Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Don't restore CUDA contexts #4442

Merged
merged 3 commits into from
Sep 3, 2021
Merged

Conversation

npmiller
Copy link
Contributor

This patch fixes #4171.

The issue highlighted in that ticket is that CUDA contexts are bound to
threads and PI calls are executed both in the main thread and in threads
of the thread pool.

And to ensure the active contexts are correct the CUDA plugin uses
ScopedContext RAII struct to set the active context to the PI context
and restore the previous active context at the end of the PI call.

However for optimization purposes ScopedContext skips the context
recovery if there was no active context on the thread originally, which
means it leaves the PI context active on the thread.

In addition deleting a CUDA context only deactivates it on the thread
where it is deleted, it will stay active in other threads after being
deleted.

Which means that if you start from an application with no CUDA context
active, create a SYCL queue, run an operation then delete the SYCL
queue, the context on the current thread will be created, set active,
deactivated and deleted properly.

However it won't be deactivated in the threads of the thread pools,
which means that if we create another queue and run SYCL operations on
the thread pool again, that second queue will setup its own context in
the threads but then try to restore the deleted context from the previous
queue.

This patch aims to fix that issue by simply never restoring previous
active context, which means that PI calls from the second queue running
in the thread pool would just override the deleted context and not try
to restore it.

This should work well in SYCL only code as all the PI calls are guarded
by the ScopedContext and will change the active context accordingly,
in fact it may even provide performance improvement in certain
multi-context scenarios, because the current implementation would only
really prevent context switches for the first context used, this will
prevent context switches for the latest context used instead.

In CUDA interop scenarios, however it does mean that after running any
SYCL code CUDA interop code cannot make assumptions about the active
context and needs to reset it to whatever context it needs. But as far
as I'm aware, this is already the current practice in oneMKL and
oneDNN, where they also use a ScopedContext mechanism.

In summary this patch should:

  • Fix trying to restore deleted contexts in internal SYCL threads
  • May improve performance in certain multi-context scenarios
  • Break assumptions on the active context for CUDA interop code

This patch fixes intel#4171.

The issue highlighted in that ticket is that CUDA contexts are bound to
threads and PI calls are executed both in the main thread and in threads
of the thread pool.

And to ensure the active contexts are correct the CUDA plugin uses
`ScopedContext` RAII struct to set the active context to the PI context
and restore the previous active context at the end of the PI call.

However for optimization purposes `ScopedContext` skips the context
recovery if there was no active context on the thread originally, which
means it leaves the PI context active on the thread.

In addition deleting a CUDA context only deactivates it on the thread
where it is deleted, it will stay active in other threads after being
deleted.

Which means that if you start from an application with no CUDA context
active, create a SYCL queue, run an operation then delete the SYCL
queue, the context on the current thread will be created, set active,
deactivated and deleted properly.

However it won't be deactivated in the threads of the thread pools,
which means that if we create another queue and run SYCL operations on
the thread pool again, that second queue will setup its own context in
the threads but then try to restore the deleted context from the previous
queue.

This patch aims to fix that issue by simply never restoring previous
active context, which means that PI calls from the second queue running
in the thread pool would just override the deleted context and not try
to restore it.

This should work well in SYCL only code as all the PI calls are guarded
by the `ScopedContext` and will change the active context accordingly,
in fact it may even provide performance improvement in certain
multi-context scenarios, because the current implementation would only
really prevent context switches for the first context used, this will
prevent context switches for the latest context used instead.

In CUDA interop scenarios, however it does mean that after running any
SYCL code CUDA interop code cannot make assumptions about the active
context and needs to reset it to whatever context it needs. But as far
as I'm aware, this is already the current practice in `oneMKL` and
`oneDNN`, where they also use a `ScopedContext` mechanism.

In summary this patch should:
* Fix trying to restore deleted contexts in internal SYCL threads
* May improve performance in certain multi-context scenarios
* Break assumptions on the active context for CUDA interop code
@npmiller npmiller requested a review from a team as a code owner August 31, 2021 13:02
@npmiller npmiller requested a review from steffenlarsen August 31, 2021 13:02
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed this discussion. It has been a while since we last touched this part of the implementation, but I think I understand the problem. I remember we have had other problems with "zombie contexts" previously.

As you correctly state in the comment, this was a performance optimization, but as you also state it is a trade-off between whether or not we prioritize the first context to be used in a given thread for the entirety of its life. The effect of this is that we instead assume that once a context has been used it is likely to be the next context to be used as well, which is arguably not a bad assumption either. It would be interesting to see the effect this has on existing code. Either way, I think the most important thing is that we don't set a context every time we do an operation with a context.

I expect that the potential drawbacks are outweighed by the added stability, so I think this solution is the best option for now.
That said, I am not sure I see a need for this to be a class anymore. The RAII aspect is gone and it is no longer a "scoped context". Should this be made a function instead?

@npmiller
Copy link
Contributor Author

I must have missed this discussion. It has been a while since we last touched this part of the implementation, but I think I understand the problem. I remember we have had other problems with "zombie contexts" previously.

There is more details on the associated ticket too, but the comment here should explain the bulk of it: #4171

The effect of this is that we instead assume that once a context has been used it is likely to be the next context to be used as well, which is arguably not a bad assumption either.

I actually believe it should be a better assumption in all cases, because even if the first context is the most used, this would really only end up moving the context switch from the end of the work in the second context to the beginning of the work in the first context.

It would be interesting to see the effect this has on existing code.

I tried a couple of benchmarks without seeing significant performance difference (BabelStream and Lulesh), I'm not sure if you have any other benchmark in mind.

I suppose I could try to add a couple specific multi-context tests to trip this up, and see if we can get performance results from them.

That said, I am not sure I see a need for this to be a class anymore. The RAII aspect is gone and it is no longer a "scoped context". Should this be made a function instead?

You're right that the RAII aspect is no longer needed, I wanted to keep the patch fairly short so it's easier to see what's changed as the problem is a bit difficult to explain.

I'm happy to change it for a simple function if you prefer, although I kind of like the "Scoped" aspect even if it's not used as it conveys how this is meant to be used better, for example with a free standing function someone may be tempted to move it to just the context or queue creation.

@steffenlarsen
Copy link
Contributor

I'm not sure if you have any other benchmark in mind.

Not particularly. I'm just convinced the optimization was put in place for a reason, but maybe it is not as relevant anymore.

I suppose I could try to add a couple specific multi-context tests to trip this up, and see if we can get performance results from them.

If it's possible with the current testing infrastructure, that would be nice. Seems like it should be possible to trigger the problem relatively consistently. Also gives you a reproducer in case you'd like to revisit the strategy in the future.

You're right that the RAII aspect is no longer needed, I wanted to keep the patch fairly short so it's easier to see what's changed as the problem is a bit difficult to explain.

I'm happy to change it for a simple function if you prefer, although I kind of like the "Scoped" aspect even if it's not used as it conveys how this is meant to be used better, for example with a free standing function someone may be tempted to move it to just the context or queue creation.

That's fair. I do not feel strongly about it. Especially in the case that someone in the future might try their hand at other strategies.

@Ruyk
Copy link
Contributor

Ruyk commented Aug 31, 2021

I'm not sure if you have any other benchmark in mind.

Not particularly. I'm just convinced the optimization was put in place for a reason, but maybe it is not as relevant anymore.

It was introduced to match the behaviour of the CUDA Runtime API, and in particular, the performance expectations from people porting code iteratively.
This does not seem to be a major requirement anymore, so we could ignore it.

I suppose I could try to add a couple specific multi-context tests to trip this up, and see if we can get performance results from them.

If it's possible with the current testing infrastructure, that would be nice. Seems like it should be possible to trigger the problem relatively consistently. Also gives you a reproducer in case you'd like to revisit the strategy in the future.

You're right that the RAII aspect is no longer needed, I wanted to keep the patch fairly short so it's easier to see what's changed as the problem is a bit difficult to explain.
I'm happy to change it for a simple function if you prefer, although I kind of like the "Scoped" aspect even if it's not used as it conveys how this is meant to be used better, for example with a free standing function someone may be tempted to move it to just the context or queue creation.

That's fair. I do not feel strongly about it. Especially in the case that someone in the future might try their hand at other strategies.

I feel, however, we should keep the RAII aspect on this. Even if we don't recover the context anymore, the idea of portions of functions working on a specific context remains. It also means if we have to recover the original behavior, its a single line change.

Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see such a well described PR!
Unfortunately, almost nothing of this appears in the comments of the PR.
Is it possible to save some pieces as various comments in the code or some design information in some file header comments?

@npmiller
Copy link
Contributor Author

npmiller commented Sep 1, 2021

Nice to see such a well described PR!
Unfortunately, almost nothing of this appears in the comments of the PR.
Is it possible to save some pieces as various comments in the code or some design information in some file header comments?

Good point, I have updated it to flesh out the comment on the class a bit more, I think it should be hitting all the important points without being quite as verbose as this PR's description, what do you think?

@bader bader added the cuda CUDA back-end label Sep 1, 2021
@bader bader requested a review from steffenlarsen September 1, 2021 15:20
steffenlarsen
steffenlarsen previously approved these changes Sep 1, 2021
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @npmiller Are you planning on adding the aforementioned test?

@npmiller
Copy link
Contributor Author

npmiller commented Sep 1, 2021

LGTM. @npmiller Are you planning on adding the aforementioned test?

I had a look but this is actually trickier to test than I expected as even the original snippet doesn't fail deterministically, also I wonder if a test like that should go in llvm-test-suite now that the on-device tests moved there. So it might be best to merge this as-is and maybe just add more generic multi-context testing later on.

@bader
Copy link
Contributor

bader commented Sep 1, 2021

Is it possible to reproduce this bug using CUDA plug-in unit tests?

@npmiller
Copy link
Contributor Author

npmiller commented Sep 1, 2021

Is it possible to reproduce this bug using CUDA plug-in unit tests?

Oh right, That might be a good place for it! I'll have another look at testing this, it might be easier calling the PI functions directly.

@keryell
Copy link
Contributor

keryell commented Sep 1, 2021

Good point, I have updated it to flesh out the comment on the class a bit more, I think it should be hitting all the important points without being quite as verbose as this PR's description, what do you think?

Thanks, that looks good.

@npmiller
Copy link
Contributor Author

npmiller commented Sep 2, 2021

Alright, I've just pushed 3 tests, two checking the active contexts in regular scenarios and one re-creating the thread pool scenario but in a simpler one thread test.

@bader bader requested a review from steffenlarsen September 2, 2021 14:13
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests LGTM. Minor comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SYCL] [CUDA] Invalid memory access in multithreaded application when multiple contexts used
7 participants