Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting preprocess to 600 on a GPUParticles2D causes it to take forever to load. (4.4 Dev 6) #100098

Closed
produno opened this issue Dec 6, 2024 · 21 comments · Fixed by #100378
Closed

Comments

@produno
Copy link

produno commented Dec 6, 2024

Tested versions

Reproducible in Godot 4.4 Dev6
Works fine in Godot 4.4 Dev4

System information

Godot v4.4.dev6 - Windows 10.0.22631 - Single-window, 2 monitors - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 4090 (NVIDIA; 32.0.15.6614) - 13th Gen Intel(R) Core(TM) i7-13700K (24 threads)

Issue description

Setting the preprocess of the GPUParticles2D to 600 causes the scene using this node to take ages to load. This is also true when in the editor and when running a game.

Steps to reproduce

Create a GPUParticles2D and set the preprocess to 600. Save, close Godot and try and reopen on that scene. Alternatively try and run the scene.

Minimal reproduction project (MRP)

particle-test.zip

Because this opens on the scene, you will see it takes ages to load.

@AThousandShips
Copy link
Member

AThousandShips commented Dec 6, 2024

Not sure what the time scale the preprocess is run at, i.e. how much faster than real time it runs

Please clarify what "ages" means, it's very vague here

This would be a preprocess of 10 minutes, so unsure how long this would be expected to take in real time and what the potential regression would be

@produno
Copy link
Author

produno commented Dec 6, 2024

Not sure what the time scale the preprocess is run at, i.e. how much faster than real time it runs

Please clarify what "ages" means, it's very vague here

This would be a preprocess of 10 minutes, so unsure how long this would be expected to take in real time and what the potential regression would be

Ages means way way longer than it should lol. You can check the MRP I added in dev4 and dev 6. You will see a very noticeable difference in load times.
Dev 4 it opens almost instantaneously. Dev 6 takes several minutes.

I use particles on my main menu and my game takes 3-4 mins to load that menu. If I disable particles, it takes around 10 seconds.

@AThousandShips
Copy link
Member

Details on how much things slow down are important as it might be hardware specific etc., if I test this and it isn't very slow then that is relevant information

@AThousandShips AThousandShips moved this from Unassessed to Very Bad in 4.x Release Blockers Dec 6, 2024
@AThousandShips
Copy link
Member

Can replicate, will bisect later today

@matheusmdx
Copy link
Contributor

I bisect this issue to #99426, @DarioSamo

Image


After #99426 not only takes more time to load the scene (50 ~ 55s) but also takes much more ram (the CPU usage increase didn't come from #99426)

Before:

Image

After:

Image

@AThousandShips
Copy link
Member

@matheusmdx please avoid bisecting when someone else is already doing this 🙃 I just spent about 40 minutes for nothing now because of this...

@matheusmdx
Copy link
Contributor

@AThousandShips Oh sorry, i didn't saw you're was already on this, i started to bisect before you said that you would bisect but i had to leave with 2 steps remaining so i just arrived home now and finished the bisect, i didn't saw your message.

@AThousandShips
Copy link
Member

I'd suggest in the future to comment that you are like I did

@DarioSamo
Copy link
Contributor

The regression is mostly unrelated to the PR itself. The render graph is able to re-order operations because of the changes introduced in that PR. Sadly, that means the sorting time grew exponentially because it can now detect a lot more dependencies. It is pretty much spending its entire time just sorting the command graph.

The algorithm itself is designed to work with a much lower scope of nodes per frame, and it's basically blowing up because of the huge amount of commands generated by the preprocess. A brief look revealed there's 36k commands being sorted.

Bisecting into the PR won't do us much good here, it'll be best to either just avoid reordering on these scenarios or find some way to optimize the ordering process.

Most of the time was spent entirely just on the ::end() step.

@DarioSamo
Copy link
Contributor

DarioSamo commented Dec 9, 2024

As a fun note, here's how big the adjacency list grows during the frame that particles get preprocessed.

Image

A linked list of 324 million elements.

@DarioSamo
Copy link
Contributor

DarioSamo commented Dec 9, 2024

For some extra context, while I understand Godot offers this feature, the strain on the system will be completely unsustainable for anything that isn't a high-end GPU. While I agree there's a perceived regression, the expectation around fixing this will likely require some irregular handling for preprocess in particular (e.g. the graph has to get out of the way and fall back to a linear process).

Preprocess introduces a highly irregular workload and it batches a ton of work to be done on a singular frame. You're running a really high risk of encountering a TDR in a GPU that can't take this much work in a single frame.

While the current waiting time is pretty much entirely the CPU's fault on Godot's side, you're basically playing the lottery as to whether the driver will accept a giant command list like this and you won't get a DEVICE_LOST on the GPU afterwards depending on the strength of the system.

@matheusmdx
Copy link
Contributor

I think at least should be added a note in the GPUParticles2D (and i suppose the 3D will also suffer from this too) docs.

@darksylinc
Copy link
Contributor

darksylinc commented Dec 9, 2024

I agree with Dario.

Preprocessing 10 minutes worth of particle simulation in a single frame is too much of an extreme value, and not the general case (optimizations are about optimizing for general cases).

While the CPU performance regression can be "fixed" by adding an exception so the render graph stays out of these commands; it is worth arguing if a value of 600 should be valid input at all (or at least, not allowed by default).

A value of 600 will TDR slower GPUs (such as Intel iGPUs) due to the sheer amount of GPU work to process. This makes Godot look bad and unstable; whereas the problem is not within Godot.

35k dispatches in a row is way too much. Applications that need to do something like that will likely need to tweak the TdrDelay in the Registry Editor (which is a system-wide setting).

At most the upper limit should be around 10 seconds worth of simulation. Not 600. If the users want more, they'd have to flip a switch with a big fat "WARNING!!! Going beyond this value can cause crashes on slow systems".

@produno
Copy link
Author

produno commented Dec 9, 2024

No one in their right mind is going to use a large preprocess the way it is now, so it may as well be removed. But this is quite a large breaking change for those that have done so already and wishes to update, as a feature that was available in the engine, is no longer viable.

@clayjohn
Copy link
Member

clayjohn commented Dec 9, 2024

I think it would make sense to do the following:

  1. Add an option to disable either the graph, or the parts of the graph that are choking (likely just the dependency detection stuff from dev 6
  2. Place limits on pre-process so this doesn't bite other users (I.e reduce the editor range of the slider and/or warn when hundreds of passes will be created). We can't blame users for dragging a slider all the way to the top and breaking things.
  3. Document the cost of pre-process

edit: the default range for pre-process is 0-600 seconds:

ADD_PROPERTY(PropertyInfo(Variant::FLOAT, "preprocess", PROPERTY_HINT_RANGE, "0.00,600.0,0.01,exp,suffix:s"), "set_pre_process_time", "get_pre_process_time");

We could easily reduce this and add the "is_greater" option so users can use such a large value at their own risk.

@darksylinc
Copy link
Contributor

No one in their right mind is going to use a large preprocess the way it is now, so it may as well be removed. But this is quite a large breaking change for those that have done so already and wishes to update, as a feature that was available in the engine, is no longer viable.

Visually-wise a preprocess of 10 minutes should not be much different from a preprocess of 10 seconds. If it's very different, then you're also using other extreme values (like ridiculously long lifetime values), or there is a bug in the way the simulation is preprocessed.

@DarioSamo
Copy link
Contributor

But this is quite a large breaking change for those that have done so already and wishes to update, as a feature that was available in the engine, is no longer viable.

I can't say I agree, even if the graph was behaving exactly like in dev4 or didn't exist at all, the point remains: this use case will run at the risk of blowing up the driver and running into a TDR on a system that isn't using an RTX 4090. The engine shouldn't offer a feature in such a state, it has to employ an strategy to flush such a massive amount of work in sensible chunks.

Even calling swap_buffers() in the preprocess every second of preprocessing time would solve this instantaneously. I do not find any other approach viable to actually ship a game that needs to work on weaker hardware.

The graph should obviously avoid choking at a workload like this if possible as any system should, but it's very likely to run into problems elsewhere out of the scope of the engine.

@DarioSamo
Copy link
Contributor

DarioSamo commented Dec 9, 2024

#100222 should hopefully address part of the issue as there was an incorrect detection that was exposed by the workload saturation caused by the particles preprocess.

The GPU-side of the issue remains unsolved as it is.

@produno
Copy link
Author

produno commented Dec 9, 2024

I can't say I agree, even if the graph was behaving exactly like in dev4 or didn't exist at all, the point remains: this use case will run at the risk of blowing up the driver and running into a TDR on a system that isn't using an RTX 4090. The engine shouldn't offer a feature in such a state, it has to employ an strategy to flush such a massive amount of work in sensible chunks.

I'm not sure what you mean? I've ran my game that used a 600 preprocess on my laptop, a pc with a 1080ti, a Steam Deck and the demo was downloaded by several hundreds of people and there have been no issues, at least not from me or any reported. The game actually runs very well on the Steam Deck.

Though it sounds like you are saying there was an underlying issue with the way it was implemented previously? If that's the case then obviously it does need to be addressed and fixed but if something worked fine previously and now doesn't, that is a breaking change. If someone updates their released game to Godot 4.4 and now all of a sudden some of their levels are taking several minutes to load instead of several seconds, that could be a problem.

Visually-wise a preprocess of 10 minutes should not be much different from a preprocess of 10 seconds. If it's very different, then you're also using other extreme values (like ridiculously long lifetime values), or there is a bug in the way the simulation is preprocessed.

I also use long lifetime values yeah. I use them for stars and nebulae that will continuously move down the screen. The long preprocess is so the screen is already covered before the game starts. There may be better ways to do this but it done the job, loaded quick, has virtually no performance cost and is easy to manipulate.

@DarioSamo
Copy link
Contributor

If someone updates their released game to Godot 4.4 and now all of a sudden some of their levels are taking several minutes to load instead of several seconds, that could be a problem.

I'm not talking about the loading times from the CPU side, that much is cleared that it needs to be fixed (and should be fixed with the PR I made). I'm talking about the fact that issuing 600 * 30 (18000) compute dispatches for every single particle system is likely to cause a GPU to crash. It's also abusing the command buffer allocator a lot for one single frame to overallocate far more than it'll ever need. You may not run into it yet, but that is not scalable. TDR value defaults are extremely low and only on the realm of 1-2 seconds usually.

@produno
Copy link
Author

produno commented Dec 10, 2024

I'm not talking about the loading times from the CPU side, that much is cleared that it needs to be fixed (and should be fixed with the PR I made). I'm talking about the fact that issuing 600 * 30 (18000) compute dispatches for every single particle system is likely to cause a GPU to crash. It's also abusing the command buffer allocator a lot for one single frame to overallocate far more than it'll ever need. You may not run into it yet, but that is not scalable. TDR value defaults are extremely low and only on the realm of 1-2 seconds usually.

Ok, understood. Thanks for explaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment