Support multi-stream allocation for CUDA place #37290

From00 · 2021-11-17T08:26:28Z

PR types

New features

PR changes

Others

Describe

This PR is to support multi-stream alloc and free in CUDA place.
A new StreamSafeCUDAAllocator is implement, which support safe and efficient CUDA memory alloc and GC. The core ideas are:

an allocation is associated with a CUDA stream, i.e., the stream who firstly requests this allocation
the allocation can only be re-alloced to the associated stream
other streams who use this allocation asynchronously should be recorded proactively
when trying to free an allocation, the CUDA event for the recored streams will be created, while the records from the associated stream are ignored
the free operation is delay until all events of other stream are complete to ensure the correctness of asynchronous CUDA kernel execution

Interface changes

The old interface "AllocShared", "Alloc", and "Release" use NULL stream implicitly.
A new set of interfaces was exposed, which supports to pass stream parameter in.
A "RecordedStream" interface is exposed. When reusing the memory from another CUDA stream, the "RecordedStream" should be called on the host side after kernel launch.

Notes

Only support auto_growth allocator strategy now
Set "FLAGS_use_stream_safe_cuda_allocator=true" to enable it

wanghuancoder

Retry机制，在本Stream有显存释放时，不需要通知其它的Stream

From00 · 2021-11-18T10:21:18Z

Retry机制，在本Stream有显存释放时，不需要通知其它的Stream

新提交的commit已经做了修改。在显存释放时只在RetryAllocator中通知本stream的retry，不通知其它的stream。同时，RetryAllocator重试若超时失败，上层AllocatorFacade会尝试释放所有stream的显存，然后再继续尝试分配。

…ocal strategy

wanghuancoder · 2021-11-22T03:14:22Z

paddle/fluid/memory/allocation/allocator_facade.cc

+    if (FLAGS_use_stream_safe_cuda_allocator && platform::is_gpu_place(place) &&
+        size > 0) {
+      return GetCUDAAllocator(BOOST_GET_CONST(platform::CUDAPlace, place),
+                              default_stream_);


这里的处理方式是FLAGS_use_stream_safe_cuda_allocator的优先级是最高的，高于 size==0，也高于FLAGS_use_system_allocator。但我觉得FLAGS_use_stream_safe_cuda_allocator的优先级低于另两个更合理。

谢谢，已经修改，在size>0之后加入了FLAGS_use_system_allocator == false的判断，只有在FLAGS_use_system_allocator == false的情况下，才会走多stream的逻辑：

Paddle/paddle/fluid/memory/allocation/allocator_facade.cc

Line 769 in 9b881d4

FLAGS_use_system_allocator == false) {

wanghuancoder · 2021-11-22T03:33:30Z

paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc

+void StreamSafeCUDAAllocation::RecordStream(gpuStream_t stream) {
+  VLOG(8) << "Record stream " << stream << " to " << ptr();
+  if (stream == owning_stream_) {
+    return;
+  }
+  std::lock_guard<std::mutex> lock(mutex_);
+  recorded_streams_->insert(stream);
+}


最好是在这里就调用EventRecord，我看咱们是在FreeAllocation的时候调用CreateEventForAllRecordedStream来Record的。越早的Record Event，就能越早的释放显存。

这里如果在RecordStream的时候直接创建event，虽然可以减少释放显存的delay时间，但可能会出现对同一个stream多次调用RecordStream，从而创建了多个Event的情况。在有新Alloca的时候才创建event，可以减少event的数量。
由于event的创建和查询也会有较大的时间开销，所以哪种实现方式比较好取决于上层的显存使用模式。如果上层会频繁重复调用RecordStream，则现在的实现方式更好，反之则在RecordStream的时候直接创建event比较好。这个问题目前未有定论，需要后边在实际模型中实测看看，所以先随便实现了一种，之后这块如果有比较明确的实测结论说明早点创建event比较好，再来做改进。

wanghuancoder · 2021-11-22T03:36:12Z

paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc

+bool StreamSafeCUDAAllocator::IsAllocThreadSafe() const { return true; }
+
+Allocation* StreamSafeCUDAAllocator::AllocateImpl(size_t size) {
+  std::lock_guard<std::recursive_mutex> lock(mutex_);


这里为什么使用recursive_mutex呢？以前实验中发现，Allocator的申请释放的锁是会频繁发生碰撞的，如果因为锁碰撞导致线程进入挂起状态，等另一个线程释放锁后，挂起的线程还需要唤醒。唤醒的时间成本很高。所以后来这里我们开始使用自旋锁了。

谢谢，已经改成自旋锁。

wanghuancoder · 2021-11-22T03:48:13Z

paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc

+  AllocationPtr underlying_allocation = underlying_allocator_->Allocate(size);
+  StreamSafeCUDAAllocation* allocation = new StreamSafeCUDAAllocation(
+      std::move(underlying_allocation), default_stream_);
+  allocation_info_map_[allocation] = std::make_shared<AllocationInfo>();


我建议，被申请的allocation和被Record的allocation分开存储。要知道，被申请的allocation很多，而且有很多是在频繁的申请-释放。但是会发生Record的比例非常小。如果分开存储，在ProcessEventsAndFree的时候，只需要遍历Record的allocation就行了。否则ProcessEventsAndFree的性能会差。

谢谢，已经修改。在alloca时不会往map中插入所申请allocation的信息，只有在该allocation被free时，才维护到map里。这样就不会出现每次ProcessEventsAndFree时都重复遍历大量未释放的allocation。

…_map when free but not alloc; replace recursive_mutex with SpinLock

…_safe_cuda_allocator

wanghuancoder · 2021-11-22T09:28:14Z

paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc

+  std::deque<gpuEvent_t>& outstanding_events =
+      outstanding_events_map_[allocation];


这里能不能先拿到dynamic_cast<StreamSafeCUDAAllocation*>(allocation)->GetRecordedStreams().get()，然后看这个set是否为空，如果不为空，再outstanding_events_map_[allocation]=CreateEventForAllRecordedStream() ？

谢谢，已经修改。在FreeImpl中加入了判断，recorded_streams为空的allocation直接释放，非空的才走FreeStreamSafeCUDAAllocation函数中创建outstanding_events和插入map的相关逻辑：

void StreamSafeCUDAAllocator::FreeImpl(Allocation* allocation) { std::lock_guard<SpinLock> lock_guard(spin_lock_); if (dynamic_cast<StreamSafeCUDAAllocation*>(allocation) ->GetRecordedStreams() ->empty()) { delete allocation; } else { FreeStreamSafeCUDAAllocation(allocation); } }

…_streams is empty in FreeImpl of StreamSafeCUDAAllocator

wanghuancoder

LGTM

zhiqiu · 2021-11-22T04:05:11Z

paddle/fluid/memory/malloc.cc

  return allocation::AllocatorFacade::Instance().Release(place);
 }

+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+std::shared_ptr<Allocation> AllocShared(const platform::CUDAPlace& place,


you can make stream as the last parameter and make it default nullptr

zhiqiu · 2021-11-23T07:56:28Z

paddle/fluid/memory/allocation/allocator_facade.cc

+
+    try {
+      return cuda_allocator->Allocate(size);
+    } catch (BadAlloc&) {
+      VLOG(9) << "Allocation failed when allocating " << size
+              << " bytes for stream " << stream;
+      for (auto pair : cuda_allocators_[place]) {
+        pair.second->Release(place);
+      }
+      try {
+        return cuda_allocator->Allocate(size);
+      } catch (...) {
+        VLOG(9) << "Still allocation failed "
+                << "after release memory from all streams";
+        throw;
+      }
+    } catch (...) {
+      throw;


Here can be removed to StreamSafeCUDAAllocator

…rivate to StreamSafeCUDAAllocator

wanghuancoder

LGTM

zhiqiu

LGTM

* Support multi-stream allocation for CUDA place * Do not notify the retrying from other streams when free CUDA allocation * Fix compile error for CPU * Fix compile error for HIP * Release memory for StreamSafeCUDAAllocaRetry in malloc_test * Add FLAGS_use_stream_safe_cuda_allocator * Fix CI error for 'set_tests_properties' * Invalidate stream safe CUDA allocator for naive_best_fit and thread_local strategy * Performance improvement: insert allocation pair to outstanding_events_map when free but not alloc; replace recursive_mutex with SpinLock * FLAGS priority changes: FLAGS_use_system_allocator > FLAGS_use_stream_safe_cuda_allocator * Performance improvement: directly delete allocation when the recorded_streams is empty in FreeImpl of StreamSafeCUDAAllocator * Add UT for alloc interface * Changes multi-stream interface; move retry code from AllocatorFacadePrivate to StreamSafeCUDAAllocator

Support multi-stream allocation for CUDA place

515be94

wanghuancoder reviewed Nov 17, 2021

View reviewed changes

Do not notify the retrying from other streams when free CUDA allocation

ee154ee

From00 added 6 commits November 18, 2021 20:20

Fix compile error for CPU

c9e5291

Fix compile error for HIP

e1e8012

Release memory for StreamSafeCUDAAllocaRetry in malloc_test

dc7a055

Add FLAGS_use_stream_safe_cuda_allocator

165d499

Fix CI error for 'set_tests_properties'

dc50b28

Invalidate stream safe CUDA allocator for naive_best_fit and thread_l…

43353c3

…ocal strategy

wanghuancoder reviewed Nov 22, 2021

View reviewed changes

From00 added 2 commits November 22, 2021 15:47

Performance improvement: insert allocation pair to outstanding_events…

dc1f33c

…_map when free but not alloc; replace recursive_mutex with SpinLock

FLAGS priority changes: FLAGS_use_system_allocator > FLAGS_use_stream…

9b881d4

…_safe_cuda_allocator

wanghuancoder reviewed Nov 22, 2021

View reviewed changes

Performance improvement: directly delete allocation when the recorded…

b78b055

…_streams is empty in FreeImpl of StreamSafeCUDAAllocator

wanghuancoder previously approved these changes Nov 22, 2021

View reviewed changes

Add UT for alloc interface

896a67a

From00 dismissed wanghuancoder’s stale review via 896a67a November 23, 2021 06:27

zhiqiu reviewed Nov 23, 2021

View reviewed changes

Changes multi-stream interface; move retry code from AllocatorFacadeP…

5a300a3

…rivate to StreamSafeCUDAAllocator

wanghuancoder approved these changes Nov 24, 2021

View reviewed changes

zhiqiu approved these changes Nov 24, 2021

View reviewed changes

lanxianghit approved these changes Nov 25, 2021

View reviewed changes

zhiqiu merged commit b9c464c into PaddlePaddle:develop Nov 25, 2021

From00 mentioned this pull request Nov 28, 2021

Utilize StreamSafeCUDAAllocator to support fast GC in new executor #37642

Merged

This was referenced Dec 3, 2021

Support Thread Local Allocator for CUDA Graph capturing #37814

Merged

当前Stream安全的Allocator实现的一些问题 #37815

Closed

From00 deleted the stream-safe-cuda-allocator branch December 5, 2021 03:17

From00 mentioned this pull request Dec 16, 2021

Improve the design of StreamSafeCUDAAllocator #38195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-stream allocation for CUDA place #37290

Support multi-stream allocation for CUDA place #37290

From00 commented Nov 17, 2021 •

edited

Loading

wanghuancoder left a comment

From00 commented Nov 18, 2021

wanghuancoder Nov 22, 2021

From00 Nov 22, 2021

wanghuancoder Nov 22, 2021

From00 Nov 22, 2021

wanghuancoder Nov 22, 2021

From00 Nov 22, 2021

wanghuancoder Nov 22, 2021

From00 Nov 22, 2021

wanghuancoder Nov 22, 2021

From00 Nov 22, 2021

wanghuancoder left a comment

zhiqiu Nov 22, 2021

From00 Nov 23, 2021

zhiqiu Nov 23, 2021

From00 Nov 23, 2021

wanghuancoder left a comment

zhiqiu left a comment

		std::deque<gpuEvent_t>& outstanding_events =
		outstanding_events_map_[allocation];

Support multi-stream allocation for CUDA place #37290

Support multi-stream allocation for CUDA place #37290

Conversation

From00 commented Nov 17, 2021 • edited Loading

PR types

PR changes

Describe

wanghuancoder left a comment

Choose a reason for hiding this comment

From00 commented Nov 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanghuancoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanghuancoder left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

From00 commented Nov 17, 2021 •

edited

Loading