Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Betsy: Add caching and BC1 compression support #95915

Merged
merged 2 commits into from
Sep 16, 2024

Conversation

BlueCube3310
Copy link
Contributor

@BlueCube3310 BlueCube3310 commented Aug 21, 2024

Adds BC1 compression to Betsy, which greatly improves the image quality:

etcpak betsy
etcpak betsy

In terms of import time, this is currently slightly slower than etcpak due to creating a RenderingDevice for every compress operation (additional 100-400ms on a debug build). The compression itself takes about the same time.

Edit: After implementing caching, it's now ~1.75x faster than etcpak. Layered textures now also import significantly faster.

@BlueCube3310 BlueCube3310 requested review from a team as code owners August 21, 2024 18:46
@BlueCube3310 BlueCube3310 force-pushed the betsy-bc1 branch 2 times, most recently from 4192f74 to 225cd4b Compare August 21, 2024 18:55
@fire
Copy link
Member

fire commented Aug 21, 2024

I recommend we merge only when it is clearly faster like after we have a batching system for creating the compute

@clayjohn
Copy link
Member

For the RD I would probably just make the local RenderingDevice a static variable, then initialize it the first time the function is called (will need a mutex for this). And then free it uninitialize_betsy_module

The alternative would be to try to run this on the main RenderingDevice. But then you have to ensure that the whole function is called on the rendering thread (and it won't work with the Compatibility renderer anyway)

@BlueCube3310
Copy link
Contributor Author

BlueCube3310 commented Aug 22, 2024

RenderingDevice, shader and pipeline caching is now implemented. Betsy is now ~1.75x faster than etcpak.

Layered textures (such as lightmaps) now compress even faster:
1024x1024x7 lightmap (with cache):
bcache

@BlueCube3310 BlueCube3310 changed the title Betsy: Add BC1 compression support Betsy: Add caching and BC1 compression support Aug 22, 2024
@BlueCube3310 BlueCube3310 requested review from a team as code owners August 22, 2024 11:11
@BlueCube3310
Copy link
Contributor Author

Caching can now be toggled from the project settings

@clayjohn
Copy link
Member

I think that the entire compression function might need to be behind a mutex now that I think about it. I think we will have issues if you try to compress multiple textures from multiple different threads as there is state that gets changed by submit and sync.

I'm going on holidays, but I would like @DarioSamo to chime in with his thoughts. As this will also be impacted by #90400.

I think the current code works because of the following

  1. Textures are all compressed from the same calling thread (I think?)
  2. If multiple threads call the function they each have their own device
  3. RenderingDevice has its own mutex that is locked for both submit and sync.

2 is removed by this PR. 3 is removed by the ubershader PR. And 1 might not be true, and even if it is, it might eventually change as we optimize the import process and multi thread more parts of the engine.

I think we probably have 2 robust solutions:

  1. Lazily allocate a mutex/Rd for each thread that calls this function. This would mean you could potentially understand up with N RDs if you have N threads.
  2. Add a call queue to Betsy so that Betsy is essentially single threaded and just accepts tasks in a queue

@clayjohn
Copy link
Member

I discussed this with @DarioSamo and we agreed that option 2 is probably the way to go. That means we should add a call queue and always call compress_betsy from a dedicated thread. We can use a setup similar to how we do multithreaded rendering for the servers (I.e. create a thread on the WorkerThreadPool).

I know that blows the complexity of this up. Do you want to give that a try, or would you prefer that I add that on top of what you have already done?

@BlueCube3310
Copy link
Contributor Author

I'm not too familiar with multithreading/synchronization yet, so I'd prefer if someone more experienced than me handled this.

The latest push only rebased to the master branch

@clayjohn
Copy link
Member

clayjohn commented Sep 12, 2024

Here is a little MRP that reproduces the threading issue I was worried about: multithread-compress.zip

When running this I get error spam and then I lose the Vulkan device and it crashes

@clayjohn
Copy link
Member

clayjohn commented Sep 12, 2024

Here is a commit adding a CommandQueue to Betsy clayjohn@74a1c85. What this does is route all commands to a dedicated thread. That way we can ensure that only one texture is being compressed at a time. I also took the chance to update the caching a bit (we now cache the sampler and the DXT1 buffer).

Unfortunately, there is a bug in master right now that causes a deadlock when exiting the engine with this commit #96931

Once https://github.com/godotengine/godot/issues/96931is resolved, we can add my commit on top of this PR and then merge everything at once.

Edit: #96959 fixes the issue!

@clayjohn
Copy link
Member

I've pushed an additional commit with the CommandQueue implementation. I've tested it locally and it seems to work fine.

Once #96959 is merged, then this PR should be good as well

@clayjohn
Copy link
Member

Taking a look at performance, it seems now that the most expensive parts of Betsy are:

  1. The format conversion (for a 4k texture just going from RGB8 to RGBA8 250-300ms which is easily half of the total compression time)
  2. Creating the temporary textures in the mipmap loop

For a 4k texture the timing is roughly as follows:

Format conversion (RGB8->RGBA8): 252 ms
Draw call CPU: 85ms   // (calculate offset and size, create textures, create uniform set, set push constant, dispatch)
Draw call GPU: 37ms
Copy back to CPU: 3ms

Since everything runs in lockstep, the total cost is 377ms (That is slightly wrong as I am only including the first mip level for the compression, but I include all mips for the format conversion).

I bet we can get the whole thing under 100ms. For simple RGB->RGBA conversion we should be able to handle them on the GPU (i.e. by reading the RGB format and then by writing out to RGBA). This would be almost free if we do it in the same dispatch as the compression. Then, for the per-draw call costs we have two choices:

  1. We can overlap the GPU and the CPU work by delaying calling sync() and doing the readback until right before we do the next dispatch.
  2. We can modify Betsy to compress all mip levels at once. So we only have to setup the GPU texture once, then compress all levels. This should allow the GPU to be fully saturated which will help with some of the overhead.

In either case we will only really benefit when using mipmaps. The single Mip textures won't be impacted. Given that the format conversion is the most significant part. I suggest we do that before even considering optimizing the loop

Also, all this is for a future PR. These potential optimizations shouldn't block this PR

@BlueCube3310
Copy link
Contributor Author

Thank you for the help! I've tested it locally and can confirm it works correctly

@clayjohn clayjohn modified the milestones: 4.x, 4.4 Sep 15, 2024
@clayjohn
Copy link
Member

Note for other maintainers. This PR runs Betsy on a long-running background thread using the same technique as multithreaded physics/rendering (i.e. uses a dedicated thread, but yields to the WTP). This means that it will be a good test case for our WTP as multithreaded servers are currently all disabled by default. The flipside is that if we accidentally break the WTP, we have code running in production that will break. So big changes to WTP and MT are slightly more risky

@akien-mga akien-mga merged commit 67c9708 into godotengine:master Sep 16, 2024
20 checks passed
@akien-mga
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants