Skip to content

Commit 4ccfd21

Browse files
authored
[SYCL][CUDA] Don't restore CUDA contexts (#4442)
This patch fixes #4171. The issue highlighted in that ticket is that CUDA contexts are bound to threads and PI calls are executed both in the main thread and in threads of the thread pool. And to ensure the active contexts are correct the CUDA plugin uses `ScopedContext` RAII struct to set the active context to the PI context and restore the previous active context at the end of the PI call. However for optimization purposes `ScopedContext` skips the context recovery if there was no active context on the thread originally, which means it leaves the PI context active on the thread. In addition deleting a CUDA context only deactivates it on the thread where it is deleted, it will stay active in other threads after being deleted. Which means that if you start from an application with no CUDA context active, create a SYCL queue, run an operation then delete the SYCL queue, the context on the current thread will be created, set active, deactivated and deleted properly. However it won't be deactivated in the threads of the thread pools, which means that if we create another queue and run SYCL operations on the thread pool again, that second queue will setup its own context in the threads but then try to restore the deleted context from the previous queue. This patch aims to fix that issue by simply never restoring previous active context, which means that PI calls from the second queue running in the thread pool would just override the deleted context and not try to restore it. This should work well in SYCL only code as all the PI calls are guarded by the `ScopedContext` and will change the active context accordingly, in fact it may even provide performance improvement in certain multi-context scenarios, because the current implementation would only really prevent context switches for the first context used, this will prevent context switches for the latest context used instead. In CUDA interop scenarios, however it does mean that after running any SYCL code CUDA interop code cannot make assumptions about the active context and needs to reset it to whatever context it needs. But as far as I'm aware, this is already the current practice in `oneMKL` and `oneDNN`, where they also use a `ScopedContext` mechanism. In summary this patch should: * Fix trying to restore deleted contexts in internal SYCL threads * May improve performance in certain multi-context scenarios * Break assumptions on the active context for CUDA interop code
1 parent 9827dc4 commit 4ccfd21

File tree

3 files changed

+281
-32
lines changed

3 files changed

+281
-32
lines changed

sycl/plugins/cuda/pi_cuda.cpp

+33-32
Original file line numberDiff line numberDiff line change
@@ -137,47 +137,48 @@ pi_result check_error(CUresult result, const char *function, int line,
137137
/// \cond NODOXY
138138
#define PI_CHECK_ERROR(result) check_error(result, __func__, __LINE__, __FILE__)
139139

140-
/// RAII type to guarantee recovering original CUDA context
141-
/// Scoped context is used across all PI CUDA plugin implementation
142-
/// to activate the PI Context on the current thread, matching the
143-
/// CUDA driver semantics where the context used for the CUDA Driver
144-
/// API is the one active on the thread.
145-
/// The implementation tries to avoid replacing the CUcontext if it cans
140+
/// ScopedContext is used across all PI CUDA plugin implementation to ensure
141+
/// that the proper CUDA context is active for the given PI context.
142+
//
143+
/// This class will only replace the context if necessary, and will leave the
144+
/// new context active on the current thread. If there was an active context
145+
/// already it will simply be replaced.
146+
//
147+
/// Previously active contexts are not restored for two reasons:
148+
/// * Performance: context switches are expensive so leaving the context active
149+
/// means subsequent SYCL calls with the same context will be cheaper.
150+
/// * Multi-threading cleanup: contexts are set active per thread and deleting a
151+
/// context will only deactivate it for the current thread. This means other
152+
/// threads may end up with deleted active contexts. In particular this can
153+
/// happen with host_tasks as they run in a thread pool. When the context
154+
/// associated with these tasks is deleted it will remain active in the
155+
/// threads of the thread pool. So it would be invalid for any other task
156+
/// running on these threads to try to restore the deleted context. With the
157+
/// current implementation this is not an issue because the active deleted
158+
/// context will just be replaced.
159+
//
160+
/// This approach does mean that CUDA interop tasks should NOT expect their
161+
/// contexts to be restored by SYCL.
146162
class ScopedContext {
147-
pi_context placedContext_;
148-
CUcontext original_;
149-
bool needToRecover_;
150-
151163
public:
152-
ScopedContext(pi_context ctxt) : placedContext_{ctxt}, needToRecover_{false} {
153-
154-
if (!placedContext_) {
164+
ScopedContext(pi_context ctxt) {
165+
if (!ctxt) {
155166
throw PI_INVALID_CONTEXT;
156167
}
157168

158-
CUcontext desired = placedContext_->get();
159-
PI_CHECK_ERROR(cuCtxGetCurrent(&original_));
160-
if (original_ != desired) {
161-
// Sets the desired context as the active one for the thread
169+
CUcontext desired = ctxt->get();
170+
CUcontext original = nullptr;
171+
172+
PI_CHECK_ERROR(cuCtxGetCurrent(&original));
173+
174+
// Make sure the desired context is active on the current thread, setting
175+
// it if necessary
176+
if (original != desired) {
162177
PI_CHECK_ERROR(cuCtxSetCurrent(desired));
163-
if (original_ == nullptr) {
164-
// No context is installed on the current thread
165-
// This is the most common case. We can activate the context in the
166-
// thread and leave it there until all the PI context referring to the
167-
// same underlying CUDA context are destroyed. This emulates
168-
// the behaviour of the CUDA runtime api, and avoids costly context
169-
// switches. No action is required on this side of the if.
170-
} else {
171-
needToRecover_ = true;
172-
}
173178
}
174179
}
175180

176-
~ScopedContext() {
177-
if (needToRecover_) {
178-
PI_CHECK_ERROR(cuCtxSetCurrent(original_));
179-
}
180-
}
181+
~ScopedContext() {}
181182
};
182183

183184
/// \cond NODOXY

sycl/unittests/pi/cuda/CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ set(LLVM_REQUIRES_EH 1)
22
add_sycl_unittest(PiCudaTests OBJECT
33
test_base_objects.cpp
44
test_commands.cpp
5+
test_contexts.cpp
56
test_device.cpp
67
test_interop_get_native.cpp
78
test_kernels.cpp
+247
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
//==---- test_contexts.cpp --- PI unit tests -------------------------------==//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#include "gtest/gtest.h"
10+
11+
#include <condition_variable>
12+
#include <thread>
13+
#include <mutex>
14+
15+
#include <cuda.h>
16+
17+
#include "TestGetPlugin.hpp"
18+
#include <CL/sycl.hpp>
19+
#include <CL/sycl/detail/pi.hpp>
20+
#include <detail/plugin.hpp>
21+
#include <pi_cuda.hpp>
22+
23+
using namespace cl::sycl;
24+
25+
struct CudaContextsTest : public ::testing::Test {
26+
27+
protected:
28+
detail::plugin *plugin = pi::initializeAndGet(backend::cuda);
29+
30+
pi_platform platform_;
31+
pi_device device_;
32+
33+
void SetUp() override {
34+
// skip the tests if the CUDA backend is not available
35+
if (!plugin) {
36+
GTEST_SKIP();
37+
}
38+
39+
pi_uint32 numPlatforms = 0;
40+
ASSERT_EQ(plugin->getBackend(), backend::cuda);
41+
42+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piPlatformsGet>(
43+
0, nullptr, &numPlatforms)),
44+
PI_SUCCESS)
45+
<< "piPlatformsGet failed.\n";
46+
47+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piPlatformsGet>(
48+
numPlatforms, &platform_, nullptr)),
49+
PI_SUCCESS)
50+
<< "piPlatformsGet failed.\n";
51+
52+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piDevicesGet>(
53+
platform_, PI_DEVICE_TYPE_GPU, 1, &device_, nullptr)),
54+
PI_SUCCESS);
55+
}
56+
57+
void TearDown() override {}
58+
59+
CudaContextsTest() = default;
60+
61+
~CudaContextsTest() = default;
62+
};
63+
64+
TEST_F(CudaContextsTest, ContextLifetime) {
65+
// start with no active context
66+
cuCtxSetCurrent(nullptr);
67+
68+
// create a context
69+
pi_context context;
70+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piContextCreate>(
71+
nullptr, 1, &device_, nullptr, nullptr, &context)),
72+
PI_SUCCESS);
73+
ASSERT_NE(context, nullptr);
74+
75+
// create a queue from the context, this should use the ScopedContext
76+
pi_queue queue;
77+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piQueueCreate>(
78+
context, device_, 0, &queue)),
79+
PI_SUCCESS);
80+
ASSERT_NE(queue, nullptr);
81+
82+
// ensure the queue has the correct context
83+
ASSERT_EQ(context, queue->get_context());
84+
85+
// check that the context is now the active CUDA context
86+
CUcontext cudaCtxt = nullptr;
87+
cuCtxGetCurrent(&cudaCtxt);
88+
ASSERT_EQ(cudaCtxt, context->get());
89+
90+
plugin->call<detail::PiApiKind::piQueueRelease>(queue);
91+
plugin->call<detail::PiApiKind::piContextRelease>(context);
92+
93+
// check that the context was cleaned up properly by the destructor
94+
cuCtxGetCurrent(&cudaCtxt);
95+
ASSERT_EQ(cudaCtxt, nullptr);
96+
}
97+
98+
TEST_F(CudaContextsTest, ContextLifetimeExisting) {
99+
// start by setting up a CUDA context on the thread
100+
CUcontext original;
101+
cuCtxCreate(&original, CU_CTX_MAP_HOST, device_->get());
102+
103+
// ensure the CUDA context is active
104+
CUcontext current = nullptr;
105+
cuCtxGetCurrent(&current);
106+
ASSERT_EQ(original, current);
107+
108+
// create a PI context
109+
pi_context context;
110+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piContextCreate>(
111+
nullptr, 1, &device_, nullptr, nullptr, &context)),
112+
PI_SUCCESS);
113+
ASSERT_NE(context, nullptr);
114+
115+
// create a queue from the context, this should use the ScopedContext
116+
pi_queue queue;
117+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piQueueCreate>(
118+
context, device_, 0, &queue)),
119+
PI_SUCCESS);
120+
ASSERT_NE(queue, nullptr);
121+
122+
// ensure the queue has the correct context
123+
ASSERT_EQ(context, queue->get_context());
124+
125+
// check that the context is now the active CUDA context
126+
cuCtxGetCurrent(&current);
127+
ASSERT_EQ(current, context->get());
128+
129+
plugin->call<detail::PiApiKind::piQueueRelease>(queue);
130+
plugin->call<detail::PiApiKind::piContextRelease>(context);
131+
132+
// check that the context was cleaned up, the old context will be restored
133+
// automatically by cuCtxDestroy in piContextRelease, as it was pushed on the
134+
// stack bu cuCtxCreate
135+
cuCtxGetCurrent(&current);
136+
ASSERT_EQ(current, original);
137+
138+
// release original context
139+
cuCtxDestroy(original);
140+
}
141+
142+
// In some cases (for host_task), the SYCL runtime may call PI API functions
143+
// from threads of the thread pool, this can cause issues because with the CUDA
144+
// plugin these functions will set an active CUDA context on these threads, but
145+
// never clean it up, as it will only get cleaned up in the main thread.
146+
//
147+
// So the following test aims to reproduce the scenario where there is a
148+
// dangling deleted context in a separate thread and seeing if the PI calls are
149+
// still able to work correctly in that thread.
150+
TEST_F(CudaContextsTest, ContextThread) {
151+
// start with no active context
152+
cuCtxSetCurrent(nullptr);
153+
154+
// create two PI contexts
155+
pi_context context1;
156+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piContextCreate>(
157+
nullptr, 1, &device_, nullptr, nullptr, &context1)),
158+
PI_SUCCESS);
159+
ASSERT_NE(context1, nullptr);
160+
161+
pi_context context2;
162+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piContextCreate>(
163+
nullptr, 1, &device_, nullptr, nullptr, &context2)),
164+
PI_SUCCESS);
165+
ASSERT_NE(context2, nullptr);
166+
167+
// setup synchronization variables between the main thread and the testing
168+
// thread
169+
std::mutex m;
170+
std::condition_variable cv;
171+
bool released = false;
172+
bool thread_done = false;
173+
174+
// create a testing thread that will create a queue with the first context,
175+
// release the queue, then wait for the main thread to release the first
176+
// context, and then create and release another queue with the second context
177+
// this time
178+
auto test_thread = std::thread([&] {
179+
CUcontext current = nullptr;
180+
181+
// create a queue with the first context
182+
pi_queue queue;
183+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piQueueCreate>(
184+
context1, device_, 0, &queue)),
185+
PI_SUCCESS);
186+
ASSERT_NE(queue, nullptr);
187+
188+
// ensure the queue has the correct context
189+
ASSERT_EQ(context1, queue->get_context());
190+
191+
// check that the first context is now the active CUDA context
192+
cuCtxGetCurrent(&current);
193+
ASSERT_EQ(current, context1->get());
194+
195+
plugin->call<detail::PiApiKind::piQueueRelease>(queue);
196+
197+
// mark the first set of processing as done and notify the main thread
198+
std::unique_lock<std::mutex> lock(m);
199+
thread_done = true;
200+
lock.unlock();
201+
cv.notify_one();
202+
203+
// wait for the main thread to release the first context
204+
lock.lock();
205+
cv.wait(lock, [&] { return released; });
206+
207+
// check that the first context is still active, this is because deleting a
208+
// context only cleans up the current thread
209+
cuCtxGetCurrent(&current);
210+
ASSERT_EQ(current, context1->get());
211+
212+
// create a queue with the second context
213+
ASSERT_EQ((plugin->call_nocheck<detail::PiApiKind::piQueueCreate>(
214+
context2, device_, 0, &queue)),
215+
PI_SUCCESS);
216+
ASSERT_NE(queue, nullptr);
217+
218+
// ensure the queue has the correct context
219+
ASSERT_EQ(context2, queue->get_context());
220+
221+
// check that the second context is now the active CUDA context
222+
cuCtxGetCurrent(&current);
223+
ASSERT_EQ(current, context2->get());
224+
225+
plugin->call<detail::PiApiKind::piQueueRelease>(queue);
226+
});
227+
228+
// wait for the thread to be done with the first queue to release the first context
229+
std::unique_lock<std::mutex> lock(m);
230+
cv.wait(lock, [&] { return thread_done; });
231+
plugin->call<detail::PiApiKind::piContextRelease>(context1);
232+
233+
// notify the other thread that the context was released
234+
released = true;
235+
lock.unlock();
236+
cv.notify_one();
237+
238+
// wait for the thread to finish
239+
test_thread.join();
240+
241+
plugin->call<detail::PiApiKind::piContextRelease>(context2);
242+
243+
// check that there is no context set on the main thread
244+
CUcontext current = nullptr;
245+
cuCtxGetCurrent(&current);
246+
ASSERT_EQ(current, nullptr);
247+
}

0 commit comments

Comments
 (0)