Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BlurHashDecoder performance #4515

Merged
merged 2 commits into from
Jun 16, 2024

Conversation

cbeyls
Copy link
Contributor

@cbeyls cbeyls commented Jun 16, 2024

This pull request aims to dramatically improve the performance of BlurHashDecoder while also reducing its memory allocations.

  • Precompute cosines tables before composing the image so each cosine value is only computed once.
  • Compute cosines tables once if both are identical (for square images with the same number of colors in both dimensions).
  • Store colors in a one-dimension array instead of a two-dimension array to reduce memory allocations.
  • Use a simple String.indexOf() to find the index of a Base83 char, which is both faster and needs less memory than a HashMap thanks to better locality and no boxing of chars.
  • No cache is used, so computations may be performed in parallel on background threads without the need for synchronization which limits throughput.

Benchmarks

Simple: 4x4 colors, 32x32 pixels output. (This is what Mastodon and Tusky currently use)
Complex: 9x9 colors, 256x256 pixels output.

Pixel 7 (Android 14)

      365 738   ns          23 allocs    Trace    BlurHashDecoderBenchmark.tuskySimple
      109 577   ns           8 allocs    Trace    BlurHashDecoderBenchmark.newSimple
  108 771 647   ns          88 allocs    Trace    BlurHashDecoderBenchmark.tuskyComplex
   12 932 076   ns           8 allocs    Trace    BlurHashDecoderBenchmark.newComplex

Nexus 5 (Android 6)

    4 600 937   ns          22 allocs    Trace    BlurHashDecoderBenchmark.tuskySimple
    1 391 487   ns           7 allocs    Trace    BlurHashDecoderBenchmark.newSimple
1 260 644 948   ns          87 allocs    Trace    BlurHashDecoderBenchmark.tuskyComplex
  125 274 063   ns           7 allocs    Trace    BlurHashDecoderBenchmark.newComplex

Conclusion: The new implementation is 3 times faster than the old one for the current usage and up to 9 times faster if we decide to increase the BlurHash quality in the future.

The source code of the benchmark comparing the original untouched Kotlin implementation to the new one can be found here.

Verified

This commit was signed with the committer’s verified signature.
nikclayton Nik Clayton
@connyduck
Copy link
Collaborator

Faiphone 4 (Android 13)

360.552   ns          27 allocs    Trace    BlurHashDecoderBenchmark.originalSimpleWithCache
859.396   ns          27 allocs    Trace    BlurHashDecoderBenchmark.originalSimpleNoCache
183.598   ns           9 allocs    Trace    BlurHashDecoderBenchmark.newSimple

😲 🚀

@cbeyls
Copy link
Contributor Author

cbeyls commented Jun 16, 2024

The current Tusky code is like the nocache variant but a bit better since it doesn't allocate cache memory when no cache is used. For a fair comparison you should copy the Tusky class in the Benchmark project.

Also I just realized I can add an extra optimization so don't merge this quite yet.

Verified

This commit was signed with the committer’s verified signature.
nikclayton Nik Clayton
@cbeyls
Copy link
Contributor Author

cbeyls commented Jun 16, 2024

Okay, I made it a bit faster (and one less allocation) for square images with the same number of colors in both dimensions.
Ready to merge.

@connyduck
Copy link
Collaborator

204.078   ns           8 allocs    Trace    BlurHashDecoderBenchmark.newSimple
616.728   ns          23 allocs    Trace    BlurHashDecoderBenchmark.originalSimple

Copy link
Collaborator

@connyduck connyduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work

@connyduck connyduck merged commit 125483d into tuskyapp:develop Jun 16, 2024
3 checks passed
nikclayton added a commit to pachli/pachli-android that referenced this pull request Jun 20, 2024
By Christophe Beyls in tuskyapp/Tusky#4515. Their
commit notes:

Improve the performance of `BlurHashDecoder` while also reducing memory
allocations.

- Precompute cosines tables before composing the image so each cosine
  value is only computed once.
- Compute cosines tables once if both are identical (for square images
  with the same number of colors in both dimensions).
- Store colors in a one-dimension array instead of a two-dimension array
  to reduce memory allocations.
- Use a simple String.indexOf() to find the index of a Base83 char,
  which is both faster and needs less memory than a HashMap thanks to
  better locality and no boxing of chars.
- No cache is used, so computations may be performed in parallel on
  background threads without the need for synchronization which limits
  throughput.
nikclayton added a commit to pachli/pachli-android that referenced this pull request Jun 20, 2024
By Christophe Beyls in tuskyapp/Tusky#4515.
Their commit notes:

Improve the performance of `BlurHashDecoder` while also reducing memory
allocations.

- Precompute cosines tables before composing the image so each cosine
value is only computed once.
- Compute cosines tables once if both are identical (for square images
with the same number of colors in both dimensions).
- Store colors in a one-dimension array instead of a two-dimension array
to reduce memory allocations.
- Use a simple String.indexOf() to find the index of a Base83 char,
which is both faster and needs less memory than a HashMap thanks to
better locality and no boxing of chars.
- No cache is used, so computations may be performed in parallel on
background threads without the need for synchronization which limits
throughput.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants