Improve BlurHashDecoder performance #4515

cbeyls · 2024-06-16T16:08:45Z

This pull request aims to dramatically improve the performance of BlurHashDecoder while also reducing its memory allocations.

Precompute cosines tables before composing the image so each cosine value is only computed once.
Compute cosines tables once if both are identical (for square images with the same number of colors in both dimensions).
Store colors in a one-dimension array instead of a two-dimension array to reduce memory allocations.
Use a simple String.indexOf() to find the index of a Base83 char, which is both faster and needs less memory than a HashMap thanks to better locality and no boxing of chars.
No cache is used, so computations may be performed in parallel on background threads without the need for synchronization which limits throughput.

Benchmarks

Simple: 4x4 colors, 32x32 pixels output. (This is what Mastodon and Tusky currently use)
Complex: 9x9 colors, 256x256 pixels output.

Pixel 7 (Android 14)

      365 738   ns          23 allocs    Trace    BlurHashDecoderBenchmark.tuskySimple
      109 577   ns           8 allocs    Trace    BlurHashDecoderBenchmark.newSimple
  108 771 647   ns          88 allocs    Trace    BlurHashDecoderBenchmark.tuskyComplex
   12 932 076   ns           8 allocs    Trace    BlurHashDecoderBenchmark.newComplex

Nexus 5 (Android 6)

    4 600 937   ns          22 allocs    Trace    BlurHashDecoderBenchmark.tuskySimple
    1 391 487   ns           7 allocs    Trace    BlurHashDecoderBenchmark.newSimple
1 260 644 948   ns          87 allocs    Trace    BlurHashDecoderBenchmark.tuskyComplex
  125 274 063   ns           7 allocs    Trace    BlurHashDecoderBenchmark.newComplex

Conclusion: The new implementation is 3 times faster than the old one for the current usage and up to 9 times faster if we decide to increase the BlurHash quality in the future.

The source code of the benchmark comparing the original untouched Kotlin implementation to the new one can be found here.

connyduck · 2024-06-16T17:17:29Z

Faiphone 4 (Android 13)

360.552   ns          27 allocs    Trace    BlurHashDecoderBenchmark.originalSimpleWithCache
859.396   ns          27 allocs    Trace    BlurHashDecoderBenchmark.originalSimpleNoCache
183.598   ns           9 allocs    Trace    BlurHashDecoderBenchmark.newSimple

😲 🚀

cbeyls · 2024-06-16T17:21:48Z

The current Tusky code is like the nocache variant but a bit better since it doesn't allocate cache memory when no cache is used. For a fair comparison you should copy the Tusky class in the Benchmark project.

Also I just realized I can add an extra optimization so don't merge this quite yet.

cbeyls · 2024-06-16T18:02:25Z

Okay, I made it a bit faster (and one less allocation) for square images with the same number of colors in both dimensions.
Ready to merge.

connyduck · 2024-06-16T18:10:28Z

204.078   ns           8 allocs    Trace    BlurHashDecoderBenchmark.newSimple
616.728   ns          23 allocs    Trace    BlurHashDecoderBenchmark.originalSimple

connyduck

Amazing work

By Christophe Beyls in tuskyapp/Tusky#4515. Their commit notes: Improve the performance of `BlurHashDecoder` while also reducing memory allocations. - Precompute cosines tables before composing the image so each cosine value is only computed once. - Compute cosines tables once if both are identical (for square images with the same number of colors in both dimensions). - Store colors in a one-dimension array instead of a two-dimension array to reduce memory allocations. - Use a simple String.indexOf() to find the index of a Base83 char, which is both faster and needs less memory than a HashMap thanks to better locality and no boxing of chars. - No cache is used, so computations may be performed in parallel on background threads without the need for synchronization which limits throughput.

improve BlurHashDecoder performance and reduce memory allocations

Verified

This commit was signed with the committer’s verified signature.

nikclayton Nik Clayton

GPG key ID: F95268159C2EC897

Verified
Learn about vigilant mode

d111276

optimization: don't compute cosinus tables twice if they are identical

Verified

This commit was signed with the committer’s verified signature.

nikclayton Nik Clayton

GPG key ID: F95268159C2EC897

Verified
Learn about vigilant mode

377453e

connyduck approved these changes Jun 16, 2024

View reviewed changes

connyduck merged commit 125483d into tuskyapp:develop Jun 16, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BlurHashDecoder performance #4515

Improve BlurHashDecoder performance #4515

cbeyls commented Jun 16, 2024 •

edited

Loading

connyduck commented Jun 16, 2024

cbeyls commented Jun 16, 2024

cbeyls commented Jun 16, 2024

connyduck commented Jun 16, 2024

connyduck left a comment

Improve BlurHashDecoder performance #4515

Improve BlurHashDecoder performance #4515

Conversation

cbeyls commented Jun 16, 2024 • edited Loading

Benchmarks

connyduck commented Jun 16, 2024

cbeyls commented Jun 16, 2024

cbeyls commented Jun 16, 2024

connyduck commented Jun 16, 2024

connyduck left a comment

Choose a reason for hiding this comment

cbeyls commented Jun 16, 2024 •

edited

Loading