[SYCL][CUDA] Don't restore CUDA contexts #4442

npmiller · 2021-08-31T13:02:33Z

This patch fixes #4171.

The issue highlighted in that ticket is that CUDA contexts are bound to
threads and PI calls are executed both in the main thread and in threads
of the thread pool.

And to ensure the active contexts are correct the CUDA plugin uses
ScopedContext RAII struct to set the active context to the PI context
and restore the previous active context at the end of the PI call.

However for optimization purposes ScopedContext skips the context
recovery if there was no active context on the thread originally, which
means it leaves the PI context active on the thread.

In addition deleting a CUDA context only deactivates it on the thread
where it is deleted, it will stay active in other threads after being
deleted.

Which means that if you start from an application with no CUDA context
active, create a SYCL queue, run an operation then delete the SYCL
queue, the context on the current thread will be created, set active,
deactivated and deleted properly.

However it won't be deactivated in the threads of the thread pools,
which means that if we create another queue and run SYCL operations on
the thread pool again, that second queue will setup its own context in
the threads but then try to restore the deleted context from the previous
queue.

This patch aims to fix that issue by simply never restoring previous
active context, which means that PI calls from the second queue running
in the thread pool would just override the deleted context and not try
to restore it.

This should work well in SYCL only code as all the PI calls are guarded
by the ScopedContext and will change the active context accordingly,
in fact it may even provide performance improvement in certain
multi-context scenarios, because the current implementation would only
really prevent context switches for the first context used, this will
prevent context switches for the latest context used instead.

In CUDA interop scenarios, however it does mean that after running any
SYCL code CUDA interop code cannot make assumptions about the active
context and needs to reset it to whatever context it needs. But as far
as I'm aware, this is already the current practice in oneMKL and
oneDNN, where they also use a ScopedContext mechanism.

In summary this patch should:

Fix trying to restore deleted contexts in internal SYCL threads
May improve performance in certain multi-context scenarios
Break assumptions on the active context for CUDA interop code

This patch fixes intel#4171. The issue highlighted in that ticket is that CUDA contexts are bound to threads and PI calls are executed both in the main thread and in threads of the thread pool. And to ensure the active contexts are correct the CUDA plugin uses `ScopedContext` RAII struct to set the active context to the PI context and restore the previous active context at the end of the PI call. However for optimization purposes `ScopedContext` skips the context recovery if there was no active context on the thread originally, which means it leaves the PI context active on the thread. In addition deleting a CUDA context only deactivates it on the thread where it is deleted, it will stay active in other threads after being deleted. Which means that if you start from an application with no CUDA context active, create a SYCL queue, run an operation then delete the SYCL queue, the context on the current thread will be created, set active, deactivated and deleted properly. However it won't be deactivated in the threads of the thread pools, which means that if we create another queue and run SYCL operations on the thread pool again, that second queue will setup its own context in the threads but then try to restore the deleted context from the previous queue. This patch aims to fix that issue by simply never restoring previous active context, which means that PI calls from the second queue running in the thread pool would just override the deleted context and not try to restore it. This should work well in SYCL only code as all the PI calls are guarded by the `ScopedContext` and will change the active context accordingly, in fact it may even provide performance improvement in certain multi-context scenarios, because the current implementation would only really prevent context switches for the first context used, this will prevent context switches for the latest context used instead. In CUDA interop scenarios, however it does mean that after running any SYCL code CUDA interop code cannot make assumptions about the active context and needs to reset it to whatever context it needs. But as far as I'm aware, this is already the current practice in `oneMKL` and `oneDNN`, where they also use a `ScopedContext` mechanism. In summary this patch should: * Fix trying to restore deleted contexts in internal SYCL threads * May improve performance in certain multi-context scenarios * Break assumptions on the active context for CUDA interop code

steffenlarsen

I must have missed this discussion. It has been a while since we last touched this part of the implementation, but I think I understand the problem. I remember we have had other problems with "zombie contexts" previously.

As you correctly state in the comment, this was a performance optimization, but as you also state it is a trade-off between whether or not we prioritize the first context to be used in a given thread for the entirety of its life. The effect of this is that we instead assume that once a context has been used it is likely to be the next context to be used as well, which is arguably not a bad assumption either. It would be interesting to see the effect this has on existing code. Either way, I think the most important thing is that we don't set a context every time we do an operation with a context.

I expect that the potential drawbacks are outweighed by the added stability, so I think this solution is the best option for now.
That said, I am not sure I see a need for this to be a class anymore. The RAII aspect is gone and it is no longer a "scoped context". Should this be made a function instead?

sycl/plugins/cuda/pi_cuda.cpp

npmiller · 2021-08-31T14:57:49Z

I must have missed this discussion. It has been a while since we last touched this part of the implementation, but I think I understand the problem. I remember we have had other problems with "zombie contexts" previously.

There is more details on the associated ticket too, but the comment here should explain the bulk of it: #4171

The effect of this is that we instead assume that once a context has been used it is likely to be the next context to be used as well, which is arguably not a bad assumption either.

I actually believe it should be a better assumption in all cases, because even if the first context is the most used, this would really only end up moving the context switch from the end of the work in the second context to the beginning of the work in the first context.

It would be interesting to see the effect this has on existing code.

I tried a couple of benchmarks without seeing significant performance difference (BabelStream and Lulesh), I'm not sure if you have any other benchmark in mind.

I suppose I could try to add a couple specific multi-context tests to trip this up, and see if we can get performance results from them.

That said, I am not sure I see a need for this to be a class anymore. The RAII aspect is gone and it is no longer a "scoped context". Should this be made a function instead?

You're right that the RAII aspect is no longer needed, I wanted to keep the patch fairly short so it's easier to see what's changed as the problem is a bit difficult to explain.

I'm happy to change it for a simple function if you prefer, although I kind of like the "Scoped" aspect even if it's not used as it conveys how this is meant to be used better, for example with a free standing function someone may be tempted to move it to just the context or queue creation.

steffenlarsen · 2021-08-31T15:13:00Z

I'm not sure if you have any other benchmark in mind.

Not particularly. I'm just convinced the optimization was put in place for a reason, but maybe it is not as relevant anymore.

I suppose I could try to add a couple specific multi-context tests to trip this up, and see if we can get performance results from them.

If it's possible with the current testing infrastructure, that would be nice. Seems like it should be possible to trigger the problem relatively consistently. Also gives you a reproducer in case you'd like to revisit the strategy in the future.

You're right that the RAII aspect is no longer needed, I wanted to keep the patch fairly short so it's easier to see what's changed as the problem is a bit difficult to explain.

I'm happy to change it for a simple function if you prefer, although I kind of like the "Scoped" aspect even if it's not used as it conveys how this is meant to be used better, for example with a free standing function someone may be tempted to move it to just the context or queue creation.

That's fair. I do not feel strongly about it. Especially in the case that someone in the future might try their hand at other strategies.

Ruyk · 2021-08-31T15:30:21Z

I'm not sure if you have any other benchmark in mind.

Not particularly. I'm just convinced the optimization was put in place for a reason, but maybe it is not as relevant anymore.

It was introduced to match the behaviour of the CUDA Runtime API, and in particular, the performance expectations from people porting code iteratively.
This does not seem to be a major requirement anymore, so we could ignore it.

I suppose I could try to add a couple specific multi-context tests to trip this up, and see if we can get performance results from them.

If it's possible with the current testing infrastructure, that would be nice. Seems like it should be possible to trigger the problem relatively consistently. Also gives you a reproducer in case you'd like to revisit the strategy in the future.

You're right that the RAII aspect is no longer needed, I wanted to keep the patch fairly short so it's easier to see what's changed as the problem is a bit difficult to explain.
I'm happy to change it for a simple function if you prefer, although I kind of like the "Scoped" aspect even if it's not used as it conveys how this is meant to be used better, for example with a free standing function someone may be tempted to move it to just the context or queue creation.

That's fair. I do not feel strongly about it. Especially in the case that someone in the future might try their hand at other strategies.

I feel, however, we should keep the RAII aspect on this. Even if we don't recover the context anymore, the idea of portions of functions working on a specific context remains. It also means if we have to recover the original behavior, its a single line change.

keryell

Nice to see such a well described PR!
Unfortunately, almost nothing of this appears in the comments of the PR.
Is it possible to save some pieces as various comments in the code or some design information in some file header comments?

npmiller · 2021-09-01T12:51:51Z

Nice to see such a well described PR!
Unfortunately, almost nothing of this appears in the comments of the PR.
Is it possible to save some pieces as various comments in the code or some design information in some file header comments?

Good point, I have updated it to flesh out the comment on the class a bit more, I think it should be hitting all the important points without being quite as verbose as this PR's description, what do you think?

steffenlarsen

LGTM. @npmiller Are you planning on adding the aforementioned test?

npmiller · 2021-09-01T15:38:43Z

LGTM. @npmiller Are you planning on adding the aforementioned test?

I had a look but this is actually trickier to test than I expected as even the original snippet doesn't fail deterministically, also I wonder if a test like that should go in llvm-test-suite now that the on-device tests moved there. So it might be best to merge this as-is and maybe just add more generic multi-context testing later on.

bader · 2021-09-01T15:50:43Z

Is it possible to reproduce this bug using CUDA plug-in unit tests?

npmiller · 2021-09-01T15:58:48Z

Is it possible to reproduce this bug using CUDA plug-in unit tests?

Oh right, That might be a good place for it! I'll have another look at testing this, it might be easier calling the PI functions directly.

keryell · 2021-09-01T21:24:02Z

Good point, I have updated it to flesh out the comment on the class a bit more, I think it should be hitting all the important points without being quite as verbose as this PR's description, what do you think?

Thanks, that looks good.

npmiller · 2021-09-02T13:21:14Z

Alright, I've just pushed 3 tests, two checking the active contexts in regular scenarios and one re-creating the thread pool scenario but in a simpler one thread test.

steffenlarsen

New tests LGTM. Minor comment.

sycl/unittests/pi/cuda/test_contexts.cpp

npmiller requested a review from a team as a code owner August 31, 2021 13:02

npmiller requested a review from steffenlarsen August 31, 2021 13:02

steffenlarsen reviewed Aug 31, 2021

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved

keryell reviewed Aug 31, 2021

View reviewed changes

[SYCL][CUDA] Cleanup ScopedContext class

998bf32

bader added the cuda CUDA back-end label Sep 1, 2021

bader requested a review from steffenlarsen September 1, 2021 15:20

steffenlarsen previously approved these changes Sep 1, 2021

View reviewed changes

[SYCL][CUDA] Add contexts tests

d168fa5

npmiller dismissed steffenlarsen’s stale review via d168fa5 September 2, 2021 13:19

npmiller requested a review from a team as a code owner September 2, 2021 13:19

npmiller requested a review from alexbatashev September 2, 2021 13:19

bader requested a review from steffenlarsen September 2, 2021 14:13

steffenlarsen approved these changes Sep 2, 2021

View reviewed changes

sycl/unittests/pi/cuda/test_contexts.cpp Show resolved Hide resolved

alexbatashev approved these changes Sep 2, 2021

View reviewed changes

romanovvlad merged commit 4ccfd21 into intel:sycl Sep 3, 2021

npmiller mentioned this pull request Sep 3, 2021

[SYCL][CUDA] Fix context clearing in PiCuda tests #4483

Merged

npmiller mentioned this pull request Nov 29, 2021

CTS stream_api_core test segfault #3588

Closed

npmiller mentioned this pull request May 24, 2022

Introduce CUDA backend specification KhronosGroup/SYCL-Docs#197

Closed

npmiller mentioned this pull request Oct 5, 2023

[SYCL][UR][HIP] Hip adapter multi device ctx #11105

Closed

hdelan mentioned this pull request Oct 20, 2023

[HIP] Remove reinstation of context oneapi-src/unified-runtime#983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Don't restore CUDA contexts #4442

[SYCL][CUDA] Don't restore CUDA contexts #4442

npmiller commented Aug 31, 2021

steffenlarsen left a comment

npmiller commented Aug 31, 2021

steffenlarsen commented Aug 31, 2021

Ruyk commented Aug 31, 2021

keryell left a comment

npmiller commented Sep 1, 2021

steffenlarsen left a comment

npmiller commented Sep 1, 2021

bader commented Sep 1, 2021

npmiller commented Sep 1, 2021

keryell commented Sep 1, 2021

npmiller commented Sep 2, 2021

steffenlarsen left a comment

[SYCL][CUDA] Don't restore CUDA contexts #4442

[SYCL][CUDA] Don't restore CUDA contexts #4442

Conversation

npmiller commented Aug 31, 2021

steffenlarsen left a comment

Choose a reason for hiding this comment

npmiller commented Aug 31, 2021

steffenlarsen commented Aug 31, 2021

Ruyk commented Aug 31, 2021

keryell left a comment

Choose a reason for hiding this comment

npmiller commented Sep 1, 2021

steffenlarsen left a comment

Choose a reason for hiding this comment

npmiller commented Sep 1, 2021

bader commented Sep 1, 2021

npmiller commented Sep 1, 2021

keryell commented Sep 1, 2021

npmiller commented Sep 2, 2021

steffenlarsen left a comment

Choose a reason for hiding this comment