-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Don't restore CUDA contexts #4442
Conversation
This patch fixes intel#4171. The issue highlighted in that ticket is that CUDA contexts are bound to threads and PI calls are executed both in the main thread and in threads of the thread pool. And to ensure the active contexts are correct the CUDA plugin uses `ScopedContext` RAII struct to set the active context to the PI context and restore the previous active context at the end of the PI call. However for optimization purposes `ScopedContext` skips the context recovery if there was no active context on the thread originally, which means it leaves the PI context active on the thread. In addition deleting a CUDA context only deactivates it on the thread where it is deleted, it will stay active in other threads after being deleted. Which means that if you start from an application with no CUDA context active, create a SYCL queue, run an operation then delete the SYCL queue, the context on the current thread will be created, set active, deactivated and deleted properly. However it won't be deactivated in the threads of the thread pools, which means that if we create another queue and run SYCL operations on the thread pool again, that second queue will setup its own context in the threads but then try to restore the deleted context from the previous queue. This patch aims to fix that issue by simply never restoring previous active context, which means that PI calls from the second queue running in the thread pool would just override the deleted context and not try to restore it. This should work well in SYCL only code as all the PI calls are guarded by the `ScopedContext` and will change the active context accordingly, in fact it may even provide performance improvement in certain multi-context scenarios, because the current implementation would only really prevent context switches for the first context used, this will prevent context switches for the latest context used instead. In CUDA interop scenarios, however it does mean that after running any SYCL code CUDA interop code cannot make assumptions about the active context and needs to reset it to whatever context it needs. But as far as I'm aware, this is already the current practice in `oneMKL` and `oneDNN`, where they also use a `ScopedContext` mechanism. In summary this patch should: * Fix trying to restore deleted contexts in internal SYCL threads * May improve performance in certain multi-context scenarios * Break assumptions on the active context for CUDA interop code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed this discussion. It has been a while since we last touched this part of the implementation, but I think I understand the problem. I remember we have had other problems with "zombie contexts" previously.
As you correctly state in the comment, this was a performance optimization, but as you also state it is a trade-off between whether or not we prioritize the first context to be used in a given thread for the entirety of its life. The effect of this is that we instead assume that once a context has been used it is likely to be the next context to be used as well, which is arguably not a bad assumption either. It would be interesting to see the effect this has on existing code. Either way, I think the most important thing is that we don't set a context every time we do an operation with a context.
I expect that the potential drawbacks are outweighed by the added stability, so I think this solution is the best option for now.
That said, I am not sure I see a need for this to be a class anymore. The RAII aspect is gone and it is no longer a "scoped context". Should this be made a function instead?
There is more details on the associated ticket too, but the comment here should explain the bulk of it: #4171
I actually believe it should be a better assumption in all cases, because even if the first context is the most used, this would really only end up moving the context switch from the end of the work in the second context to the beginning of the work in the first context.
I tried a couple of benchmarks without seeing significant performance difference (BabelStream and Lulesh), I'm not sure if you have any other benchmark in mind. I suppose I could try to add a couple specific multi-context tests to trip this up, and see if we can get performance results from them.
You're right that the RAII aspect is no longer needed, I wanted to keep the patch fairly short so it's easier to see what's changed as the problem is a bit difficult to explain. I'm happy to change it for a simple function if you prefer, although I kind of like the "Scoped" aspect even if it's not used as it conveys how this is meant to be used better, for example with a free standing function someone may be tempted to move it to just the context or queue creation. |
Not particularly. I'm just convinced the optimization was put in place for a reason, but maybe it is not as relevant anymore.
If it's possible with the current testing infrastructure, that would be nice. Seems like it should be possible to trigger the problem relatively consistently. Also gives you a reproducer in case you'd like to revisit the strategy in the future.
That's fair. I do not feel strongly about it. Especially in the case that someone in the future might try their hand at other strategies. |
It was introduced to match the behaviour of the CUDA Runtime API, and in particular, the performance expectations from people porting code iteratively.
I feel, however, we should keep the RAII aspect on this. Even if we don't recover the context anymore, the idea of portions of functions working on a specific context remains. It also means if we have to recover the original behavior, its a single line change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see such a well described PR!
Unfortunately, almost nothing of this appears in the comments of the PR.
Is it possible to save some pieces as various comments in the code or some design information in some file header comments?
Good point, I have updated it to flesh out the comment on the class a bit more, I think it should be hitting all the important points without being quite as verbose as this PR's description, what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @npmiller Are you planning on adding the aforementioned test?
I had a look but this is actually trickier to test than I expected as even the original snippet doesn't fail deterministically, also I wonder if a test like that should go in |
Is it possible to reproduce this bug using CUDA plug-in unit tests? |
Oh right, That might be a good place for it! I'll have another look at testing this, it might be easier calling the PI functions directly. |
Thanks, that looks good. |
Alright, I've just pushed 3 tests, two checking the active contexts in regular scenarios and one re-creating the thread pool scenario but in a simpler one thread test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tests LGTM. Minor comment.
This patch fixes #4171.
The issue highlighted in that ticket is that CUDA contexts are bound to
threads and PI calls are executed both in the main thread and in threads
of the thread pool.
And to ensure the active contexts are correct the CUDA plugin uses
ScopedContext
RAII struct to set the active context to the PI contextand restore the previous active context at the end of the PI call.
However for optimization purposes
ScopedContext
skips the contextrecovery if there was no active context on the thread originally, which
means it leaves the PI context active on the thread.
In addition deleting a CUDA context only deactivates it on the thread
where it is deleted, it will stay active in other threads after being
deleted.
Which means that if you start from an application with no CUDA context
active, create a SYCL queue, run an operation then delete the SYCL
queue, the context on the current thread will be created, set active,
deactivated and deleted properly.
However it won't be deactivated in the threads of the thread pools,
which means that if we create another queue and run SYCL operations on
the thread pool again, that second queue will setup its own context in
the threads but then try to restore the deleted context from the previous
queue.
This patch aims to fix that issue by simply never restoring previous
active context, which means that PI calls from the second queue running
in the thread pool would just override the deleted context and not try
to restore it.
This should work well in SYCL only code as all the PI calls are guarded
by the
ScopedContext
and will change the active context accordingly,in fact it may even provide performance improvement in certain
multi-context scenarios, because the current implementation would only
really prevent context switches for the first context used, this will
prevent context switches for the latest context used instead.
In CUDA interop scenarios, however it does mean that after running any
SYCL code CUDA interop code cannot make assumptions about the active
context and needs to reset it to whatever context it needs. But as far
as I'm aware, this is already the current practice in
oneMKL
andoneDNN
, where they also use aScopedContext
mechanism.In summary this patch should: