Multithreading and Optimization Discussion #6

TheCherno · 2022-12-10T10:29:27Z

Hey all! For those following along with the series on YouTube, I hope you've been enjoying it thus far! In the latest episode (Ep 11), we introduced multithreading into our code base by using std::for_each with the parallel execution policy. I mentioned that if the community has any other suggestions, and wants to do some testing to see if we can multithread in a more efficient way, I'd open a GitHub issue for this discussion - and here we are!

I figured crowd-sourcing this would be a good idea since your mileage may vary - certain techniques might be better or worse on certain hardware and architectures. Feel free to fork this repository and implement something, and then drop it in a comment here so we can test. If your method is faster than our std::for_each method, make sure to include some profiling data and your hardware specifications.

Thanks all! ❤️

Cherno

The text was updated successfully, but these errors were encountered:

cvb941 · 2022-12-10T11:34:42Z

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

scorpion-26 · 2022-12-10T11:52:58Z

@TheCherno I checked the newest commit out and tried to run it under x64 Linux and using clang as the compiler.
Aside from some difficulties compiling under Linux, std::for_each() by default only runs on 1 thread. Apparently, for multithreading to work with libstdc++, one needs to install Intel TBB and link to it with -ltbb (Docs). With tbb installed and linked, it now runs fairly slowly(Ryzen 7 3800x, default camera angle, 1600x900-> 40-50ms). CPU usage hovers around 90%, so not perfect.
I'll try to implement a manual Thread-Pool style approach to see if that is any better.

cvb941 · 2022-12-10T12:49:38Z

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

After enabling OpenMP in Project properties > C/C++ > Language, I have added the #pragma omp parallel for clause to the outer loop:

#pragma omp parallel for
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

This gets me from 65ms single threaded to 35ms. However, the for_each implementation runs even lower at 28ms.

The difference comes from the default static schedule which splits the for loop into equal chunks for each CPU thread.
This makes the threads that render the sky finish faster than the ones rendering the ball.
Adding schedule(dynamic, 1) should distribute the work row-by-row to the threads, making it behave similarly to for_each and
speeding up the rendering down to 28ms.

#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

Code is in my fork at https://github.com/cvb941/RayTracing

cvb941 · 2022-12-10T14:03:12Z

Speedup from 65ms to 28ms isn't very much on an 8-core machine. The profiler shows a lot of time being spent in the Random::Vec3 function, where the random number generator probably synchronizes the threads, causing a bottleneck there.

Just using a simple rand() in the Rand::Float() function reduces the frame time from 28ms to 10ms. Theoretically, this can be reduced further with a faster random (returning a constant achieves 7ms).

static float Float()
{
    return static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    // return (float)s_Distribution(s_RandomEngine) / (float)std::numeric_limits<uint32_t>::max();
}

leddoo · 2022-12-10T14:06:00Z

i mean, do i need to say anything? :D

was pretty confused about the 8x increase in physical core count only getting a 2x speedup. (i'm getting around 3x on my i7-8750h -- 180ms -> 65ms -- which both seems high, but that's beside the point)

for this kind of simple parallelization of a compute heavy task, i'd expect a speedup of 90% * #physical-cores. (hyperthreading doesn't tend to help much, when you're compute bound. and there are always some losses, hence the 90%)

but yeah, get some thread local RNGs going :D

HildingLinden · 2022-12-10T14:53:04Z

For fast non-cryptographically-secure PRNGs I usually go with the xoshiro family of generators since they are robust enough for my applications and very fast. I've often implemented one myself but a wrapper like Xoshiro-cpp seem quite practical.

athieman · 2022-12-10T16:38:25Z

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

Update: @leddoo already covered it.

xLoge · 2022-12-11T15:36:52Z

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each.
https://github.com/xLoge/RayTracing

LilyIsTrans · 2022-12-16T08:40:16Z

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

I tried this and it worked quite well. My render times were consistently around 5-6ms, and CPU usage was well below 100%, closer to 10%. It seems that once RNG is thread local, the bottleneck is no longer the CPU, though I don't know how to verify that or figure out what the new bottleneck is.

LeoMash · 2022-12-24T00:17:59Z

@Rilazy The new "bottleneck" is not a bottleneck - it's a vsync enabled by default in the Imgui renderer. So after your CPU finishes calculation for new frame image, it waits for gpu vsync event. To test that you can increase viewport size (on my CPU current scene renders slower than 60fps in 4k viewport), or disable vsync in Imgui by uncommenting define IMGUI_UNLIMITED_FRAME_RATE in Application.cpp.

LilyIsTrans · 2022-12-24T01:20:19Z

Interesting, thanks!

soerenfi · 2022-12-29T19:03:53Z

So what are you guys using for profiling? I set up Tracy but It's not really usable due to the large amount of threads being created and destroyed with the current architecture.
I am going to look into creating a threadpool with number of threads according to available cores. Does that make sense?

LilyIsTrans · 2022-12-29T21:24:58Z

I've just been using Visual Studio's built in profiler when I need proper profiling, but mostly I'm looking at the frame times and seeing what they hover around

soerenfi · 2022-12-29T23:33:24Z

Alright, it's done here. Added a thread pool and queue for the rendering jobs. Every Line of the image is submitted to the queue individually. Here is how it looks in Tracy:

performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction.

LilyIsTrans · 2022-12-29T23:34:59Z

Have you implemented the optimization making random number generation thread local? If not, it's possible your thread pool just isn't handling synchronization as well as the std::foreach method

LilyIsTrans · 2022-12-30T01:39:26Z

@soerenfi Disregard my previous comment for now; it looks like it's the same thing I ran into with vsync, since your framerate looks to be locked at pretty much exactly 60fps. What's happening is that ImGui is set up by default to wait for the next time your monitor refreshes the image to start drawing the new one. LeoMash explained it here and ways to get around it when I ran into the same thing.

soerenfi · 2022-12-30T09:29:48Z

@Rilazy yeah, I saw that. For now I just compared CPU load between the approaches and varied the amount of threads in the pool.

Profiler shows that the workers are idling some of the time so that's probably the difference. Did not expect to write a better scheduler than the OS's in one evening.

No point in optimising this further since the ray tracing algorithm itself is so inefficient right now. We could do the PerPixel() intersection calculations on the GPU using CUDA though, just as an exercise..

LilyIsTrans · 2022-12-30T21:50:05Z

Another option for moving work to the GPU would be the Vulkan ray tracing pipeline itself. I'm not super familiar with GPU programming, but reading up on the Vulkan API it looks like we could be allowed to still use implicit functions for our spheres by passing them as AABBs that enclose them, then doing the more expensive intersection testing in the intersection shader, and from there I think all of our existing code is valid to port pretty much directly to shader code.

I'd have tested it myself already if it weren't for the fact that I'm not familiar enough with graphics programming to have the first clue on how to actually set up Vulkan to draw to the framebuffer of our viewport, I've only ever done the "Draw a triangle" Vulkan tutorial. I've been reading the docs, but there are a lot of docs.

Of course, that all does depend on a graphics driver that supports the Vulkan Ray Tracing Extensions.

soerenfi · 2022-12-30T22:27:33Z

I believe that is what @TheCherno is aiming to do, and if so, I am looking forward to it very much. Since Vulkan takes care of building BVHs and (HW-)accelerating ray-triangle intersections, there is not much point in spending time to implement them on the CPU (still worth understanding them though).

Meanwhile you can check out the NVIDIA guide to VK raytracing (uses NVIDIA HW specific support libraries and this amazing implementation of Peter Shirley's Raytracing in one Weekend using Vulkan Raytracing.

LilyIsTrans · 2022-12-30T22:34:22Z

Neat! I'll have to check that out.

soerenfi · 2022-12-31T00:23:34Z

I'll also add this more simplified path tracing example that I'm about to dive into.

mmerabet42 · 2023-01-11T16:10:19Z

I stumbled into your video, and i think there is a better way to use iterators for std::for_each.
Basically you could make your own iterator with all the appropriate operators (following the LegacyForwardIt requirements) which simply increments an integer. This way you're not intializing a whole vector with N elements going from X to Y. But instead you'll have two iterators, one representing the moving index and a second one representing the end iterator, and thanks to the != operator of the iterator, std::for_each knows when to stop.

template <typename T>
class ForEachIterator
{
public:
  ForEachIterator(const T &p_value) : _value(p_value) {}

  ForEachIterator<T> &operator++()
  {
    ++this->_value;
    return *this;
  }

  T &operator*()
  {
    return this->_value;
  }

  bool operator!=(const ForEachIterator<T> &p_it)
  {
    return this->_value != p_it._value;
  }

private:
  T _value;
};

And you can you use it this way:

std::for_each(ForEachIterator<uint32_t>(0), ForEachIterator<uint32_t>(width), [](const uint32_t &i) {
  ...
});

WhoseTheNerd · 2023-01-20T13:44:04Z

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each. https://github.com/xLoge/RayTracing

SHISHUA is faster than xoshiro256+x8.

Paulo-D2000 · 2023-01-21T06:47:25Z

Has anyone tried Intel OneAPI TBB for the threads parallelization?

denizdiktas · 2023-01-31T11:03:30Z

Initializing a vector of random unit normals gave me another 30-40% performance improvement on top of making the random number generator thread_local , as @TheCherno described in his corresponding YouTube episode.

denizdiktas · 2023-02-03T07:18:31Z

ok here is what I have done later on: I have added a timer that just measures the time it takes for the code that renders the image (basically a timer surrounding just the code with #define MT) by taking the average render time (accumulate the timer and divide by the number of total frames so far). By caching the normals and measuring the performance this way my render time dropped from about 30ms to 24ms on average. there are about 1 million cached random directions, so the image quality more or less remains the same (at least visually).

denizdiktas · 2023-02-06T07:00:33Z

Regarding different threading and parallelization methods: I think we should first measure the performance of the scheduled rendering code inside the thread and the overall threading method together with the rendering code. The difference will then tell us how much of a room there is for improvement. For this I have done the following: leaving the code as it is, I have measured the time it takes for rendering only (basically, the code inside the lambda function) and associated this measured time with its corresponding thread: now to do this a natural way would be to use an associative container (std::map or std::unordered_map) but I didn't want to add an extra overhead by including a lock for this map as well, instead I used a preallocated array of float values to keep track of the total rendering time inside each thread. This eliminated the overhead of locking & unlocking a map, so it has minimal impact on the measurements.

Now why am I talking about threads here? @TheCherno uses for_each(std::execution::par,..) anyway, right? The std implementation must be using some kind of a thread-pool in the background, I have tried this in a separate project: I have submitted a bunch of trivial jobs and looked at unique thread-ids. It turns out that the number of acutal worker-threads is less than the number of submitted jobs, and if some jobs take more time than others, the number of threads tends to increase.
So applying this method to the ray-tracing code, I have found out that there are about 8-9 threads (I have 8 cores) and there is very little difference between the time it takes just for rendering and the overall time for rendering + scheduling. Of course the average time in each thread is going to vary, but in my case the variation was like 1ms in 22ms. Also the difference between the thread rendering times and the overall rendering time was like about 1-2 ms (overall was about 23-24ms). So I am not sure if this is a significant overhead when using for_each(std::execution::par).

To be honest, first I implemented my own thread-pool and quickly realized that there are lots of improvements that can be done on it. Then I wondered if it was worth it for this particular rendering problem (it is still always a good idea to think about the potential improvements regardless of the use case of course, I am aware of this. I am just talking in the context of this problem). That's when and how I thought of a method for finding out how much room there is for improvement. I would love to hear your thoughts about this and let me know if something is not clear.

denizdiktas · 2023-02-07T13:33:37Z

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

LilyIsTrans · 2023-02-13T18:40:31Z

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

I'm not sure, but I think it probably has to do with it being a Forward iterator as opposed to a Random Access iterator. Because the iterator doesn't provide us with Random Access, it must be iterated sequentially, since knowing the index of each pixel/column technically requires knowing the index of the previous, as computed by a mutable, non shareable object

denizdiktas · 2023-02-14T04:29:01Z

Hi @Rilazy,
Ok thanks, I'll look into it. I was trying to do a tiled-rendering version and didn't want to change the usage of for_each(execution::par). So I had to implement my own tile-iterators, otherwise I have to define a long list of tiles and then for_each() needs to touch each memory location, which seems inefficient. It is also a good exercise to implement such an iterator compatible with for_each, just for its own sake 🙂

denizdiktas · 2023-02-15T19:43:54Z

another idea: why not sample the ray just over the hemi-sphere in the tangent-space? This has 2 benefits: first we will use only two random samples (in spherical coordinates just two angles are sufficient to determine a unit direction vector). Also since we are sampling over the hemi-sphere looking outwards, all of the rays will go out into the scene, while in @TheCherno implementation all directions are sampled over the whole unit sphere, so on average half of the bouncing rays will hit the surface from which they emanate and won't contribute to the final result. To measure the effectiveness of this, we should define a performance metric that measures how much time it takes for the final image to settle to its 'equilibrium' state. I am saying equilibrium in quotation marks, because there is no true equilibrium in the sense we reach in finite amount of time, but a visually pleasing one should be sufficient. What do you think?

BenBE · 2023-02-15T21:13:26Z

You could define this equilibrium state to be when the overall (absolute) change of pixel values ((old-new)^2)is pushed below a certain threshold. The metric would then be two-fold:

How many iterations does it take for the threshold to be reached
How much (absolute) time does it take to render that many frames

The first metric favors implementations that converge to their eigenstate quickly, while the second metric favors implementations that avoid overhead in the implementation and thus offer high render performance. Incidently, implementations good at 1 have an advantage in 2 also as fewer iterations have to be computed.

A third approach would divide the time taken to reach equilibrium by the number of frames needed, resulting in a time-per-frame. While high single-frame performance might be of interest,this is by far the least interesting aspect, as bad-looking results at high performance is like watching static being generated at random in response to a task to generate the Mona Lisa.

That said, maybe a quick note re optimizations: One should also take into account the specifics of the hardware things run on. When working with large blocks of memory it might be good practice, to provide the CPU with some information for prefetching, as well as e.g. timing memory accesses in a way to avoid DRAM refreshes or working from cache while memory write-back can present a bottleneck. Have a look at CPU and GPU hardware effects for other types of things that might be relevant for structuring your workloads. HTH.

denizdiktas · 2023-02-20T12:01:45Z

Hi @BenBE, thank you. I had similar things in mind. I will post here once have initial implemetations. Also it looks like most of the discussions are in the Discord channel, right?

BenBE · 2023-02-20T12:04:48Z

I'm not in the discord channel, thus cannot say too much about the discussion there. So, if there are any important new findings I'd appreciate hearing of them here.

denizdiktas · 2023-02-23T07:28:47Z

Btw, you can view my fork at the link below (let me know if there are any problems):
https://github.com/denizdiktas/RayTracing/

denizdiktas · 2023-03-03T07:41:35Z

Hi everyone. I have modified the code to schedule the tasks at different granularity levels. At the beginning of the Rendere.cpp file there are some preprocessor switches that begin with MT_TASK_GRANULARITY_
Comment in the one you want to activate and comment out the others. On my system
'MT_TASK_GRANULARITY_ ROW' seems to be the best performer.

Regarding the timers: The overall average rendering time and the time it takes to execute the code inside each worker-thread are measured individually and printed to the console at 1sec intervals. This is to give an idea on the following

How much variation there is among the code run in worker-threads
The overhead for scheduling the tasks

I would be glad if you could try it and share your results and thoughts.
You can checkout the code at the following link:
https://github.com/denizdiktas/RayTracing/

denizdiktas · 2023-03-03T09:03:23Z

Hi again, I made further improvement by eliminating whole BUNDLES (BEAMS) of rays by testing them against each sphere in the scene. There is a new preprocessor called 'USE_TILE_BEAM_INTERSECTION_TEST'. Comment it in if you want to use it. However, it can only be used with tiled-rendering for now. Note that I have delibarately used red color to show the parts on the screen where the intersection test fails and just the sky-color should be printed, I am aware of it, so don't kill me for it please 😅

mgradwohl · 2023-03-13T06:12:30Z

Has anyone taken a look at https://learn.microsoft.com/en-us/windows/win32/dxmath/directxmath-portal to see if that's faster?

nickphilomath · 2023-03-15T20:01:23Z

i learned a lot with these series. i wanna thank TheCherno. you'r the best :)
also i learned to render triangles with this program
look how beautiful they are:

denizdiktas · 2023-03-18T10:36:28Z

Has anyone taken a look at https://learn.microsoft.com/en-us/windows/win32/dxmath/directxmath-portal to see if that's faster?

You mean in Cherno's code?

mgradwohl · 2023-03-27T05:24:11Z

Alright, it's done [here]

> performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction.

I tried the same thing expecting a speedup for the same reason and didn't get it either.

denizdiktas · 2023-03-27T05:57:34Z

@mgradwohl I added code to measure the time for each separate thread as well as the total time to find out how much overhead there is (although I have to admit that it there are some issues with it). I still implemented a thread pool (as an exercise in itself) but again there was not much improvement. It looks like that after a certain point further performance improvements come mostly due to algorithmic improvements.

mgradwohl · 2023-03-27T06:51:02Z

@mgradwohl I added code to measure the time for each separate thread as well as the total time to find out how much overhead there is (although I have to admit that it there are some issues with it). I still implemented a thread pool (as an exercise in itself) but again there was not much improvement. It looks like that after a certain point further performance improvements come mostly due to algorithmic improvements.

Yes i found that caching values, adding const, moving things out of the loops when they didn't need to be, changing compiler settings, etc helped. I think one big one was a glm force intrinsics

denizdiktas · 2023-03-27T08:30:15Z

@mgradwohl cool. I think I didn't try compiler settings and glm-intrinsics. which settings did you use? could you share those, if it is ok for you?

mgradwohl · 2023-03-27T16:33:35Z

For release/distro I am using the following preprocessor defines:
GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE

Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.

For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).

I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2

You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

mgradwohl · 2023-03-28T06:13:53Z

Just an update. I tried two other random number generators Xoshiro256++ and std::ranlux48 and Xoshiro wins. I am at 10 bounces, 60fps, and my CPU is not pegged at all. All three are in my repo and easy to try out (look for // PERF TEST comments). The PRNG that is in the original code really hits the CPU.

LilyIsTrans · 2023-03-28T13:41:37Z

For release/distro I am using the following preprocessor defines: GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE

Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.

For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).

I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2

You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

Given the massively parallel nature of the task, I would expect that pushing to the widest possible SIMD constructs available would always have some benefit. Am I wrong to think that AVX512 would likely go even faster?

mgradwohl · 2023-03-28T23:10:57Z

For release/distro I am using the following preprocessor defines: GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE
Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.
For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).
I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2
You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

Given the massively parallel nature of the task, I would expect that pushing to the widest possible SIMD constructs available would always have some benefit. Am I wrong to think that AVX512 would likely go even faster?

I'll test again, my PC at work is only AVX2.

mgradwohl · 2023-03-29T04:26:18Z

I tried AVX2 vs. AVX512, no difference. However this may mean:

The compiler does not do a good job of taking advantage of AVX512
GLM does not do a good job of taking advantage of AVX512
The compiler and GLM do a good job, but the code is not suited.

duckdoom4 · 2023-04-26T07:59:21Z

Has anyone tried Intel OneAPI TBB for the threads parallelization?

I'm also curious about this

Multithreading and Optimization Discussion #6

Multithreading and Optimization Discussion #6

Comments

TheCherno commented Dec 10, 2022

cvb941 commented Dec 10, 2022

scorpion-26 commented Dec 10, 2022

cvb941 commented Dec 10, 2022 • edited Loading

cvb941 commented Dec 10, 2022

leddoo commented Dec 10, 2022

HildingLinden commented Dec 10, 2022

athieman commented Dec 10, 2022 • edited Loading

xLoge commented Dec 11, 2022 • edited Loading

LilyIsTrans commented Dec 16, 2022 • edited Loading

LeoMash commented Dec 24, 2022

LilyIsTrans commented Dec 24, 2022

soerenfi commented Dec 29, 2022

LilyIsTrans commented Dec 29, 2022

soerenfi commented Dec 29, 2022

LilyIsTrans commented Dec 29, 2022 • edited Loading

LilyIsTrans commented Dec 30, 2022

soerenfi commented Dec 30, 2022

LilyIsTrans commented Dec 30, 2022 • edited Loading

soerenfi commented Dec 30, 2022

LilyIsTrans commented Dec 30, 2022

soerenfi commented Dec 31, 2022

mmerabet42 commented Jan 11, 2023

WhoseTheNerd commented Jan 20, 2023

Paulo-D2000 commented Jan 21, 2023

denizdiktas commented Jan 31, 2023 • edited Loading

denizdiktas commented Feb 3, 2023

denizdiktas commented Feb 6, 2023 • edited Loading

denizdiktas commented Feb 7, 2023

LilyIsTrans commented Feb 13, 2023

denizdiktas commented Feb 14, 2023

denizdiktas commented Feb 15, 2023 • edited Loading

BenBE commented Feb 15, 2023

denizdiktas commented Feb 20, 2023

BenBE commented Feb 20, 2023

denizdiktas commented Feb 23, 2023

denizdiktas commented Mar 3, 2023

denizdiktas commented Mar 3, 2023

mgradwohl commented Mar 13, 2023

nickphilomath commented Mar 15, 2023

denizdiktas commented Mar 18, 2023

mgradwohl commented Mar 27, 2023

denizdiktas commented Mar 27, 2023

mgradwohl commented Mar 27, 2023

denizdiktas commented Mar 27, 2023

mgradwohl commented Mar 27, 2023

mgradwohl commented Mar 28, 2023

LilyIsTrans commented Mar 28, 2023

mgradwohl commented Mar 28, 2023

mgradwohl commented Mar 29, 2023

duckdoom4 commented Apr 26, 2023

cvb941 commented Dec 10, 2022 •

edited

Loading

athieman commented Dec 10, 2022 •

edited

Loading

xLoge commented Dec 11, 2022 •

edited

Loading

LilyIsTrans commented Dec 16, 2022 •

edited

Loading

LilyIsTrans commented Dec 29, 2022 •

edited

Loading

LilyIsTrans commented Dec 30, 2022 •

edited

Loading

denizdiktas commented Jan 31, 2023 •

edited

Loading

denizdiktas commented Feb 6, 2023 •

edited

Loading

denizdiktas commented Feb 15, 2023 •

edited

Loading