-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading and Optimization Discussion #6
Comments
@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too). Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages. |
@TheCherno I checked the newest commit out and tried to run it under x64 Linux and using clang as the compiler. |
After enabling OpenMP in Project properties > C/C++ > Language, I have added the #pragma omp parallel for
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
{
glm::vec4 color = PerPixel(x, y);
m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
...
}
} This gets me from 65ms single threaded to 35ms. However, the The difference comes from the default #pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
{
glm::vec4 color = PerPixel(x, y);
m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
...
}
} Code is in my fork at https://github.com/cvb941/RayTracing |
Speedup from 65ms to 28ms isn't very much on an 8-core machine. The profiler shows a lot of time being spent in the Just using a simple static float Float()
{
return static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
// return (float)s_Distribution(s_RandomEngine) / (float)std::numeric_limits<uint32_t>::max();
} |
i mean, do i need to say anything? :D was pretty confused about the 8x increase in physical core count only getting a 2x speedup. (i'm getting around 3x on my i7-8750h -- 180ms -> 65ms -- which both seems high, but that's beside the point) for this kind of simple parallelization of a compute heavy task, i'd expect a speedup of but yeah, get some thread local RNGs going :D |
For fast non-cryptographically-secure PRNGs I usually go with the xoshiro family of generators since they are robust enough for my applications and very fast. I've often implemented one myself but a wrapper like Xoshiro-cpp seem quite practical. |
Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission. Update: @leddoo already covered it. |
With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class
The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each. |
I tried this and it worked quite well. My render times were consistently around 5-6ms, and CPU usage was well below 100%, closer to 10%. It seems that once RNG is thread local, the bottleneck is no longer the CPU, though I don't know how to verify that or figure out what the new bottleneck is. |
@Rilazy The new "bottleneck" is not a bottleneck - it's a vsync enabled by default in the Imgui renderer. So after your CPU finishes calculation for new frame image, it waits for gpu vsync event. To test that you can increase viewport size (on my CPU current scene renders slower than 60fps in 4k viewport), or disable vsync in Imgui by uncommenting define |
Interesting, thanks! |
So what are you guys using for profiling? I set up Tracy but It's not really usable due to the large amount of threads being created and destroyed with the current architecture. |
I've just been using Visual Studio's built in profiler when I need proper profiling, but mostly I'm looking at the frame times and seeing what they hover around |
Alright, it's done here. Added a thread pool and queue for the rendering jobs. Every Line of the image is submitted to the queue individually. Here is how it looks in Tracy: performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction. |
Have you implemented the optimization making random number generation thread local? If not, it's possible your thread pool just isn't handling synchronization as well as the std::foreach method |
@soerenfi Disregard my previous comment for now; it looks like it's the same thing I ran into with vsync, since your framerate looks to be locked at pretty much exactly 60fps. What's happening is that ImGui is set up by default to wait for the next time your monitor refreshes the image to start drawing the new one. LeoMash explained it here and ways to get around it when I ran into the same thing. |
@Rilazy yeah, I saw that. For now I just compared CPU load between the approaches and varied the amount of threads in the pool. Profiler shows that the workers are idling some of the time so that's probably the difference. Did not expect to write a better scheduler than the OS's in one evening. No point in optimising this further since the ray tracing algorithm itself is so inefficient right now. We could do the PerPixel() intersection calculations on the GPU using CUDA though, just as an exercise.. |
Another option for moving work to the GPU would be the Vulkan ray tracing pipeline itself. I'm not super familiar with GPU programming, but reading up on the Vulkan API it looks like we could be allowed to still use implicit functions for our spheres by passing them as AABBs that enclose them, then doing the more expensive intersection testing in the intersection shader, and from there I think all of our existing code is valid to port pretty much directly to shader code. I'd have tested it myself already if it weren't for the fact that I'm not familiar enough with graphics programming to have the first clue on how to actually set up Vulkan to draw to the framebuffer of our viewport, I've only ever done the "Draw a triangle" Vulkan tutorial. I've been reading the docs, but there are a lot of docs. Of course, that all does depend on a graphics driver that supports the Vulkan Ray Tracing Extensions. |
I believe that is what @TheCherno is aiming to do, and if so, I am looking forward to it very much. Since Vulkan takes care of building BVHs and (HW-)accelerating ray-triangle intersections, there is not much point in spending time to implement them on the CPU (still worth understanding them though). Meanwhile you can check out the NVIDIA guide to VK raytracing (uses NVIDIA HW specific support libraries and this amazing implementation of Peter Shirley's Raytracing in one Weekend using Vulkan Raytracing. |
Neat! I'll have to check that out. |
I'll also add this more simplified path tracing example that I'm about to dive into. |
I stumbled into your video, and i think there is a better way to use iterators for
And you can you use it this way:
|
SHISHUA is faster than xoshiro256+x8. |
Has anyone tried Intel OneAPI TBB for the threads parallelization? |
Initializing a vector of random unit normals gave me another 30-40% performance improvement on top of making the random number generator thread_local , as @TheCherno described in his corresponding YouTube episode. |
ok here is what I have done later on: I have added a timer that just measures the time it takes for the code that renders the image (basically a timer surrounding just the code with #define MT) by taking the average render time (accumulate the timer and divide by the number of total frames so far). By caching the normals and measuring the performance this way my render time dropped from about 30ms to 24ms on average. there are about 1 million cached random directions, so the image quality more or less remains the same (at least visually). |
Regarding different threading and parallelization methods: I think we should first measure the performance of the scheduled rendering code inside the thread and the overall threading method together with the rendering code. The difference will then tell us how much of a room there is for improvement. For this I have done the following: leaving the code as it is, I have measured the time it takes for rendering only (basically, the code inside the lambda function) and associated this measured time with its corresponding thread: now to do this a natural way would be to use an associative container (std::map or std::unordered_map) but I didn't want to add an extra overhead by including a lock for this map as well, instead I used a preallocated array of float values to keep track of the total rendering time inside each thread. This eliminated the overhead of locking & unlocking a map, so it has minimal impact on the measurements. Now why am I talking about threads here? @TheCherno uses for_each(std::execution::par,..) anyway, right? The std implementation must be using some kind of a thread-pool in the background, I have tried this in a separate project: I have submitted a bunch of trivial jobs and looked at unique thread-ids. It turns out that the number of acutal worker-threads is less than the number of submitted jobs, and if some jobs take more time than others, the number of threads tends to increase. To be honest, first I implemented my own thread-pool and quickly realized that there are lots of improvements that can be done on it. Then I wondered if it was worth it for this particular rendering problem (it is still always a good idea to think about the potential improvements regardless of the use case of course, I am aware of this. I am just talking in the context of this problem). That's when and how I thought of a method for finding out how much room there is for improvement. I would love to hear your thoughts about this and let me know if something is not clear. |
@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it? |
I'm not sure, but I think it probably has to do with it being a Forward iterator as opposed to a Random Access iterator. Because the iterator doesn't provide us with Random Access, it must be iterated sequentially, since knowing the index of each pixel/column technically requires knowing the index of the previous, as computed by a mutable, non shareable object |
Hi @Rilazy, |
another idea: why not sample the ray just over the hemi-sphere in the tangent-space? This has 2 benefits: first we will use only two random samples (in spherical coordinates just two angles are sufficient to determine a unit direction vector). Also since we are sampling over the hemi-sphere looking outwards, all of the rays will go out into the scene, while in @TheCherno implementation all directions are sampled over the whole unit sphere, so on average half of the bouncing rays will hit the surface from which they emanate and won't contribute to the final result. To measure the effectiveness of this, we should define a performance metric that measures how much time it takes for the final image to settle to its 'equilibrium' state. I am saying equilibrium in quotation marks, because there is no true equilibrium in the sense we reach in finite amount of time, but a visually pleasing one should be sufficient. What do you think? |
You could define this equilibrium state to be when the overall (absolute) change of pixel values (
The first metric favors implementations that converge to their eigenstate quickly, while the second metric favors implementations that avoid overhead in the implementation and thus offer high render performance. Incidently, implementations good at 1 have an advantage in 2 also as fewer iterations have to be computed. A third approach would divide the time taken to reach equilibrium by the number of frames needed, resulting in a time-per-frame. While high single-frame performance might be of interest,this is by far the least interesting aspect, as bad-looking results at high performance is like watching static being generated at random in response to a task to generate the Mona Lisa. That said, maybe a quick note re optimizations: One should also take into account the specifics of the hardware things run on. When working with large blocks of memory it might be good practice, to provide the CPU with some information for prefetching, as well as e.g. timing memory accesses in a way to avoid DRAM refreshes or working from cache while memory write-back can present a bottleneck. Have a look at CPU and GPU hardware effects for other types of things that might be relevant for structuring your workloads. HTH. |
Hi @BenBE, thank you. I had similar things in mind. I will post here once have initial implemetations. Also it looks like most of the discussions are in the Discord channel, right? |
I'm not in the discord channel, thus cannot say too much about the discussion there. So, if there are any important new findings I'd appreciate hearing of them here. |
Btw, you can view my fork at the link below (let me know if there are any problems): |
Hi everyone. I have modified the code to schedule the tasks at different granularity levels. At the beginning of the Rendere.cpp file there are some preprocessor switches that begin with Regarding the timers: The overall average rendering time and the time it takes to execute the code inside each worker-thread are measured individually and printed to the console at 1sec intervals. This is to give an idea on the following
I would be glad if you could try it and share your results and thoughts. |
Hi again, I made further improvement by eliminating whole BUNDLES (BEAMS) of rays by testing them against each sphere in the scene. There is a new preprocessor called 'USE_TILE_BEAM_INTERSECTION_TEST'. Comment it in if you want to use it. However, it can only be used with tiled-rendering for now. Note that I have delibarately used red color to show the parts on the screen where the intersection test fails and just the sky-color should be printed, I am aware of it, so don't kill me for it please 😅 |
Has anyone taken a look at https://learn.microsoft.com/en-us/windows/win32/dxmath/directxmath-portal to see if that's faster? |
You mean in Cherno's code? |
> performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction. I tried the same thing expecting a speedup for the same reason and didn't get it either. |
@mgradwohl I added code to measure the time for each separate thread as well as the total time to find out how much overhead there is (although I have to admit that it there are some issues with it). I still implemented a thread pool (as an exercise in itself) but again there was not much improvement. It looks like that after a certain point further performance improvements come mostly due to algorithmic improvements. |
Yes i found that caching values, adding const, moving things out of the loops when they didn't need to be, changing compiler settings, etc helped. I think one big one was a glm force intrinsics |
@mgradwohl cool. I think I didn't try compiler settings and glm-intrinsics. which settings did you use? could you share those, if it is ok for you? |
For release/distro I am using the following preprocessor defines: Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for. For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit). I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2 You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj |
Just an update. I tried two other random number generators Xoshiro256++ and std::ranlux48 and Xoshiro wins. I am at 10 bounces, 60fps, and my CPU is not pegged at all. All three are in my repo and easy to try out (look for // PERF TEST comments). The PRNG that is in the original code really hits the CPU. |
Given the massively parallel nature of the task, I would expect that pushing to the widest possible SIMD constructs available would always have some benefit. Am I wrong to think that AVX512 would likely go even faster? |
I'll test again, my PC at work is only AVX2. |
I tried AVX2 vs. AVX512, no difference. However this may mean:
|
I'm also curious about this |
Hey all! For those following along with the series on YouTube, I hope you've been enjoying it thus far! In the latest episode (Ep 11), we introduced multithreading into our code base by using
std::for_each
with the parallel execution policy. I mentioned that if the community has any other suggestions, and wants to do some testing to see if we can multithread in a more efficient way, I'd open a GitHub issue for this discussion - and here we are!I figured crowd-sourcing this would be a good idea since your mileage may vary - certain techniques might be better or worse on certain hardware and architectures. Feel free to fork this repository and implement something, and then drop it in a comment here so we can test. If your method is faster than our
std::for_each
method, make sure to include some profiling data and your hardware specifications.Thanks all! ❤️
The text was updated successfully, but these errors were encountered: