-
-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Betsy: Add caching and BC1 compression support #95915
Conversation
4192f74
to
225cd4b
Compare
I recommend we merge only when it is clearly faster like after we have a batching system for creating the compute |
For the RD I would probably just make the local RenderingDevice a static variable, then initialize it the first time the function is called (will need a mutex for this). And then free it The alternative would be to try to run this on the main RenderingDevice. But then you have to ensure that the whole function is called on the rendering thread (and it won't work with the Compatibility renderer anyway) |
225cd4b
to
6ecfabb
Compare
6ecfabb
to
e749ccf
Compare
e749ccf
to
55838ca
Compare
Caching can now be toggled from the project settings |
I think that the entire compression function might need to be behind a mutex now that I think about it. I think we will have issues if you try to compress multiple textures from multiple different threads as there is state that gets changed by submit and sync. I'm going on holidays, but I would like @DarioSamo to chime in with his thoughts. As this will also be impacted by #90400. I think the current code works because of the following
2 is removed by this PR. 3 is removed by the ubershader PR. And 1 might not be true, and even if it is, it might eventually change as we optimize the import process and multi thread more parts of the engine. I think we probably have 2 robust solutions:
|
I discussed this with @DarioSamo and we agreed that option 2 is probably the way to go. That means we should add a call queue and always call I know that blows the complexity of this up. Do you want to give that a try, or would you prefer that I add that on top of what you have already done? |
55838ca
to
606eedb
Compare
I'm not too familiar with multithreading/synchronization yet, so I'd prefer if someone more experienced than me handled this. The latest push only rebased to the master branch |
Here is a little MRP that reproduces the threading issue I was worried about: multithread-compress.zip When running this I get error spam and then I lose the Vulkan device and it crashes |
Here is a commit adding a CommandQueue to Betsy clayjohn@74a1c85. What this does is route all commands to a dedicated thread. That way we can ensure that only one texture is being compressed at a time. I also took the chance to update the caching a bit (we now cache the sampler and the DXT1 buffer). Unfortunately, there is a bug in master right now that causes a deadlock when exiting the engine with this commit #96931 Once https://github.com/godotengine/godot/issues/96931is resolved, we can add my commit on top of this PR and then merge everything at once. Edit: #96959 fixes the issue! |
I've pushed an additional commit with the CommandQueue implementation. I've tested it locally and it seems to work fine. Once #96959 is merged, then this PR should be good as well |
Taking a look at performance, it seems now that the most expensive parts of Betsy are:
For a 4k texture the timing is roughly as follows:
Since everything runs in lockstep, the total cost is 377ms (That is slightly wrong as I am only including the first mip level for the compression, but I include all mips for the format conversion). I bet we can get the whole thing under 100ms. For simple RGB->RGBA conversion we should be able to handle them on the GPU (i.e. by reading the RGB format and then by writing out to RGBA). This would be almost free if we do it in the same dispatch as the compression. Then, for the per-draw call costs we have two choices:
In either case we will only really benefit when using mipmaps. The single Mip textures won't be impacted. Given that the format conversion is the most significant part. I suggest we do that before even considering optimizing the loop Also, all this is for a future PR. These potential optimizations shouldn't block this PR |
Thank you for the help! I've tested it locally and can confirm it works correctly |
Note for other maintainers. This PR runs Betsy on a long-running background thread using the same technique as multithreaded physics/rendering (i.e. uses a dedicated thread, but yields to the WTP). This means that it will be a good test case for our WTP as multithreaded servers are currently all disabled by default. The flipside is that if we accidentally break the WTP, we have code running in production that will break. So big changes to WTP and MT are slightly more risky |
Thanks! |
Adds BC1 compression to Betsy, which greatly improves the image quality:
In terms of import time,
this is currently slightly slower than etcpak due to creating a RenderingDevice for every compress operation (additional 100-400ms on a debug build). The compression itself takes about the same time.Edit: After implementing caching, it's now ~1.75x faster than etcpak. Layered textures now also import significantly faster.