Setting preprocess to 600 on a GPUParticles2D causes it to take forever to load. (4.4 Dev 6) #100098

produno · 2024-12-06T13:20:18Z

Tested versions

Reproducible in Godot 4.4 Dev6
Works fine in Godot 4.4 Dev4

System information

Godot v4.4.dev6 - Windows 10.0.22631 - Single-window, 2 monitors - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 4090 (NVIDIA; 32.0.15.6614) - 13th Gen Intel(R) Core(TM) i7-13700K (24 threads)

Issue description

Setting the preprocess of the GPUParticles2D to 600 causes the scene using this node to take ages to load. This is also true when in the editor and when running a game.

Steps to reproduce

Create a GPUParticles2D and set the preprocess to 600. Save, close Godot and try and reopen on that scene. Alternatively try and run the scene.

Minimal reproduction project (MRP)

particle-test.zip

Because this opens on the scene, you will see it takes ages to load.

AThousandShips · 2024-12-06T13:23:52Z

Not sure what the time scale the preprocess is run at, i.e. how much faster than real time it runs

Please clarify what "ages" means, it's very vague here

This would be a preprocess of 10 minutes, so unsure how long this would be expected to take in real time and what the potential regression would be

produno · 2024-12-06T13:40:58Z

Not sure what the time scale the preprocess is run at, i.e. how much faster than real time it runs

Please clarify what "ages" means, it's very vague here

This would be a preprocess of 10 minutes, so unsure how long this would be expected to take in real time and what the potential regression would be

Ages means way way longer than it should lol. You can check the MRP I added in dev4 and dev 6. You will see a very noticeable difference in load times.
Dev 4 it opens almost instantaneously. Dev 6 takes several minutes.

I use particles on my main menu and my game takes 3-4 mins to load that menu. If I disable particles, it takes around 10 seconds.

AThousandShips · 2024-12-06T13:46:32Z

Details on how much things slow down are important as it might be hardware specific etc., if I test this and it isn't very slow then that is relevant information

AThousandShips · 2024-12-06T13:55:25Z

Can replicate, will bisect later today

matheusmdx · 2024-12-06T15:09:20Z

I bisect this issue to #99426, @DarioSamo

After #99426 not only takes more time to load the scene (50 ~ 55s) but also takes much more ram (the CPU usage increase didn't come from #99426)

Before:

After:

AThousandShips · 2024-12-06T15:10:41Z

@matheusmdx please avoid bisecting when someone else is already doing this 🙃 I just spent about 40 minutes for nothing now because of this...

matheusmdx · 2024-12-06T15:14:47Z

@AThousandShips Oh sorry, i didn't saw you're was already on this, i started to bisect before you said that you would bisect but i had to leave with 2 steps remaining so i just arrived home now and finished the bisect, i didn't saw your message.

AThousandShips · 2024-12-06T15:16:08Z

I'd suggest in the future to comment that you are like I did

DarioSamo · 2024-12-09T19:07:45Z

The regression is mostly unrelated to the PR itself. The render graph is able to re-order operations because of the changes introduced in that PR. Sadly, that means the sorting time grew exponentially because it can now detect a lot more dependencies. It is pretty much spending its entire time just sorting the command graph.

The algorithm itself is designed to work with a much lower scope of nodes per frame, and it's basically blowing up because of the huge amount of commands generated by the preprocess. A brief look revealed there's 36k commands being sorted.

Bisecting into the PR won't do us much good here, it'll be best to either just avoid reordering on these scenarios or find some way to optimize the ordering process.

Most of the time was spent entirely just on the ::end() step.

DarioSamo · 2024-12-09T19:10:22Z

As a fun note, here's how big the adjacency list grows during the frame that particles get preprocessed.

A linked list of 324 million elements.

DarioSamo · 2024-12-09T19:27:56Z

For some extra context, while I understand Godot offers this feature, the strain on the system will be completely unsustainable for anything that isn't a high-end GPU. While I agree there's a perceived regression, the expectation around fixing this will likely require some irregular handling for preprocess in particular (e.g. the graph has to get out of the way and fall back to a linear process).

Preprocess introduces a highly irregular workload and it batches a ton of work to be done on a singular frame. You're running a really high risk of encountering a TDR in a GPU that can't take this much work in a single frame.

While the current waiting time is pretty much entirely the CPU's fault on Godot's side, you're basically playing the lottery as to whether the driver will accept a giant command list like this and you won't get a DEVICE_LOST on the GPU afterwards depending on the strength of the system.

matheusmdx · 2024-12-09T19:32:54Z

I think at least should be added a note in the GPUParticles2D (and i suppose the 3D will also suffer from this too) docs.

darksylinc · 2024-12-09T19:36:15Z

I agree with Dario.

Preprocessing 10 minutes worth of particle simulation in a single frame is too much of an extreme value, and not the general case (optimizations are about optimizing for general cases).

While the CPU performance regression can be "fixed" by adding an exception so the render graph stays out of these commands; it is worth arguing if a value of 600 should be valid input at all (or at least, not allowed by default).

A value of 600 will TDR slower GPUs (such as Intel iGPUs) due to the sheer amount of GPU work to process. This makes Godot look bad and unstable; whereas the problem is not within Godot.

35k dispatches in a row is way too much. Applications that need to do something like that will likely need to tweak the TdrDelay in the Registry Editor (which is a system-wide setting).

At most the upper limit should be around 10 seconds worth of simulation. Not 600. If the users want more, they'd have to flip a switch with a big fat "WARNING!!! Going beyond this value can cause crashes on slow systems".

produno · 2024-12-09T20:01:05Z

No one in their right mind is going to use a large preprocess the way it is now, so it may as well be removed. But this is quite a large breaking change for those that have done so already and wishes to update, as a feature that was available in the engine, is no longer viable.

clayjohn · 2024-12-09T20:05:12Z

I think it would make sense to do the following:

Add an option to disable either the graph, or the parts of the graph that are choking (likely just the dependency detection stuff from dev 6
Place limits on pre-process so this doesn't bite other users (I.e reduce the editor range of the slider and/or warn when hundreds of passes will be created). We can't blame users for dragging a slider all the way to the top and breaking things.
Document the cost of pre-process

edit: the default range for pre-process is 0-600 seconds:

godot/scene/3d/gpu_particles_3d.cpp

Line 760 in aa8d9b8

    
           ADD_PROPERTY(PropertyInfo(Variant::FLOAT, "preprocess", PROPERTY_HINT_RANGE, "0.00,600.0,0.01,exp,suffix:s"), "set_pre_process_time", "get_pre_process_time");

We could easily reduce this and add the "is_greater" option so users can use such a large value at their own risk.

darksylinc · 2024-12-09T20:35:18Z

No one in their right mind is going to use a large preprocess the way it is now, so it may as well be removed. But this is quite a large breaking change for those that have done so already and wishes to update, as a feature that was available in the engine, is no longer viable.

Visually-wise a preprocess of 10 minutes should not be much different from a preprocess of 10 seconds. If it's very different, then you're also using other extreme values (like ridiculously long lifetime values), or there is a bug in the way the simulation is preprocessed.

DarioSamo · 2024-12-09T22:04:57Z

But this is quite a large breaking change for those that have done so already and wishes to update, as a feature that was available in the engine, is no longer viable.

I can't say I agree, even if the graph was behaving exactly like in dev4 or didn't exist at all, the point remains: this use case will run at the risk of blowing up the driver and running into a TDR on a system that isn't using an RTX 4090. The engine shouldn't offer a feature in such a state, it has to employ an strategy to flush such a massive amount of work in sensible chunks.

Even calling swap_buffers() in the preprocess every second of preprocessing time would solve this instantaneously. I do not find any other approach viable to actually ship a game that needs to work on weaker hardware.

The graph should obviously avoid choking at a workload like this if possible as any system should, but it's very likely to run into problems elsewhere out of the scope of the engine.

DarioSamo · 2024-12-09T22:35:42Z

#100222 should hopefully address part of the issue as there was an incorrect detection that was exposed by the workload saturation caused by the particles preprocess.

The GPU-side of the issue remains unsolved as it is.

produno · 2024-12-09T23:58:07Z

I can't say I agree, even if the graph was behaving exactly like in dev4 or didn't exist at all, the point remains: this use case will run at the risk of blowing up the driver and running into a TDR on a system that isn't using an RTX 4090. The engine shouldn't offer a feature in such a state, it has to employ an strategy to flush such a massive amount of work in sensible chunks.

I'm not sure what you mean? I've ran my game that used a 600 preprocess on my laptop, a pc with a 1080ti, a Steam Deck and the demo was downloaded by several hundreds of people and there have been no issues, at least not from me or any reported. The game actually runs very well on the Steam Deck.

Though it sounds like you are saying there was an underlying issue with the way it was implemented previously? If that's the case then obviously it does need to be addressed and fixed but if something worked fine previously and now doesn't, that is a breaking change. If someone updates their released game to Godot 4.4 and now all of a sudden some of their levels are taking several minutes to load instead of several seconds, that could be a problem.

Visually-wise a preprocess of 10 minutes should not be much different from a preprocess of 10 seconds. If it's very different, then you're also using other extreme values (like ridiculously long lifetime values), or there is a bug in the way the simulation is preprocessed.

I also use long lifetime values yeah. I use them for stars and nebulae that will continuously move down the screen. The long preprocess is so the screen is already covered before the game starts. There may be better ways to do this but it done the job, loaded quick, has virtually no performance cost and is easy to manipulate.

DarioSamo · 2024-12-10T00:19:57Z

If someone updates their released game to Godot 4.4 and now all of a sudden some of their levels are taking several minutes to load instead of several seconds, that could be a problem.

I'm not talking about the loading times from the CPU side, that much is cleared that it needs to be fixed (and should be fixed with the PR I made). I'm talking about the fact that issuing 600 * 30 (18000) compute dispatches for every single particle system is likely to cause a GPU to crash. It's also abusing the command buffer allocator a lot for one single frame to overallocate far more than it'll ever need. You may not run into it yet, but that is not scalable. TDR value defaults are extremely low and only on the realm of 1-2 seconds usually.

produno · 2024-12-10T00:34:08Z

I'm not talking about the loading times from the CPU side, that much is cleared that it needs to be fixed (and should be fixed with the PR I made). I'm talking about the fact that issuing 600 * 30 (18000) compute dispatches for every single particle system is likely to cause a GPU to crash. It's also abusing the command buffer allocator a lot for one single frame to overallocate far more than it'll ever need. You may not run into it yet, but that is not scalable. TDR value defaults are extremely low and only on the realm of 1-2 seconds usually.

Ok, understood. Thanks for explaining.

AThousandShips added needs testing topic:particles labels Dec 6, 2024

AThousandShips added regression bug labels Dec 6, 2024

AThousandShips added this to the 4.4 milestone Dec 6, 2024

AThousandShips added this to 4.x Release Blockers Dec 6, 2024

github-project-automation bot moved this to Unassessed in 4.x Release Blockers Dec 6, 2024

AThousandShips added confirmed and removed needs testing labels Dec 6, 2024

AThousandShips moved this from Unassessed to Very Bad in 4.x Release Blockers Dec 6, 2024

produno mentioned this issue Dec 6, 2024

Project freezes/crashes when trying to open it in 4.4dev6 #100090

Closed

akien-mga assigned DarioSamo Dec 6, 2024

DarioSamo mentioned this issue Dec 9, 2024

Fix incorrect return value on partial coverage detection. #100222

Merged

clayjohn mentioned this issue Dec 13, 2024

Reduce slider range for Particles preprocess to discourage setting a preprocess time of 10 minutes #100378

Merged

Calinou added the performance label Dec 13, 2024

akien-mga closed this as completed in #100378 Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting preprocess to 600 on a GPUParticles2D causes it to take forever to load. (4.4 Dev 6) #100098

Setting preprocess to 600 on a GPUParticles2D causes it to take forever to load. (4.4 Dev 6) #100098

produno commented Dec 6, 2024

AThousandShips commented Dec 6, 2024 •

edited

Loading

produno commented Dec 6, 2024 •

edited

Loading

AThousandShips commented Dec 6, 2024

AThousandShips commented Dec 6, 2024

matheusmdx commented Dec 6, 2024

AThousandShips commented Dec 6, 2024

matheusmdx commented Dec 6, 2024

AThousandShips commented Dec 6, 2024

DarioSamo commented Dec 9, 2024

DarioSamo commented Dec 9, 2024 •

edited

Loading

DarioSamo commented Dec 9, 2024 •

edited

Loading

matheusmdx commented Dec 9, 2024

darksylinc commented Dec 9, 2024 •

edited

Loading

produno commented Dec 9, 2024

clayjohn commented Dec 9, 2024 •

edited

Loading

darksylinc commented Dec 9, 2024

DarioSamo commented Dec 9, 2024

DarioSamo commented Dec 9, 2024 •

edited

Loading

produno commented Dec 9, 2024

DarioSamo commented Dec 10, 2024

produno commented Dec 10, 2024

Setting preprocess to 600 on a GPUParticles2D causes it to take forever to load. (4.4 Dev 6) #100098

Setting preprocess to 600 on a GPUParticles2D causes it to take forever to load. (4.4 Dev 6) #100098

Comments

produno commented Dec 6, 2024

Tested versions

System information

Issue description

Steps to reproduce

Minimal reproduction project (MRP)

AThousandShips commented Dec 6, 2024 • edited Loading

produno commented Dec 6, 2024 • edited Loading

AThousandShips commented Dec 6, 2024

AThousandShips commented Dec 6, 2024

matheusmdx commented Dec 6, 2024

AThousandShips commented Dec 6, 2024

matheusmdx commented Dec 6, 2024

AThousandShips commented Dec 6, 2024

DarioSamo commented Dec 9, 2024

DarioSamo commented Dec 9, 2024 • edited Loading

DarioSamo commented Dec 9, 2024 • edited Loading

matheusmdx commented Dec 9, 2024

darksylinc commented Dec 9, 2024 • edited Loading

produno commented Dec 9, 2024

clayjohn commented Dec 9, 2024 • edited Loading

darksylinc commented Dec 9, 2024

DarioSamo commented Dec 9, 2024

DarioSamo commented Dec 9, 2024 • edited Loading

produno commented Dec 9, 2024

DarioSamo commented Dec 10, 2024

produno commented Dec 10, 2024

AThousandShips commented Dec 6, 2024 •

edited

Loading

produno commented Dec 6, 2024 •

edited

Loading

DarioSamo commented Dec 9, 2024 •

edited

Loading

DarioSamo commented Dec 9, 2024 •

edited

Loading

darksylinc commented Dec 9, 2024 •

edited

Loading

clayjohn commented Dec 9, 2024 •

edited

Loading

DarioSamo commented Dec 9, 2024 •

edited

Loading