-
-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Variant alloc #100861
base: master
Are you sure you want to change the base?
[Draft] Variant alloc #100861
Conversation
2a9401c
to
ceb6dec
Compare
51884a0
to
d4d2ee5
Compare
Any idea if this is faster when we're on mimalloc? (compared with and without this) |
|
The main thing this allocator does is reduce the amount of time spent under lock. PagedAllocator is protected by a global lock, VariantAllocator has thread local locks, and only needs a global lock roughly every 64 allocations. |
This is still a draft, I am using this PR to make sure that it passes all the tests, particularly on MSVC. The plan is to replace paged_allocator with this entirely. But I have to still think about how to do the non-threadsafe version of it. I might end up splitting it in two, making an even simpler one for the thread-safe version. |
I should ask, are you aware of #97016? |
I was not aware of it, but that one also has a global lock, so I don't think it'll be faster than this. EDIT: The other PRs goals and use is a lot different in general. I don't really think it is worth comparing the two. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a draft, I am using this PR to make sure that it passes all the tests, particularly on MSVC. The plan is to replace paged_allocator with this entirely. But I have to still think about how to do the non-threadsafe version of it. I might end up splitting it in two, making an even simpler one for the thread-safe version.
As in, a drop-in replacement for PagedAllocator? If so, that'll take care of the name concern entirely; but if not, we can just cross that bridge once this is out of the draft stage. Got a few codestyle tweaks in the meantime
core/templates/variant_allocator.h
Outdated
#if (__has_builtin(__builtin_popcountll)) | ||
#define builtin_popcountll | ||
#endif | ||
#if (__has_builtin(__builtin_ffsll)) | ||
#define builtin_ffsll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't know for certain if uint64_t
is unsigned long long
, I'd rather use the generic equivalents:
#if (__has_builtin(__builtin_popcountll)) | |
#define builtin_popcountll | |
#endif | |
#if (__has_builtin(__builtin_ffsll)) | |
#define builtin_ffsll | |
#if __has_builtin(__builtin_popcountg) | |
#define builtin_popcountg | |
#endif | |
#if __has_builtin(__builtin_ffsg) | |
#define builtin_ffsg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang doesn't have __builtin_ffsg
or at least the version I use doesn't. I have added some static assertions to ensure the types are equivalent.
#if defined(__GNUC__) || defined(__clang__)
#if (__has_builtin(__builtin_popcountll))
#define builtin_popcountll
static_assert(sizeof(unsigned long long) == sizeof(uint64_t), "uint64_t and unsigned long long must have the same size");
static_assert((uint64_t)(-1) == (unsigned long long)(-1), "uint64_t and unsigned long long must have the same representation");
#endif
#if (__has_builtin(__builtin_ffsll))
#define builtin_ffsll
static_assert(sizeof(long long) == sizeof(int64_t), "int64_t and long long must have the same size");
static_assert((int64_t)(-1) == (long long)(-1), "int64_t and long long must have the same representation");
#endif
#endif
Does this work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but this would entirely lockout those with unsigned long
as their uint64_t
. If we don't have a generic, then __builtin_popcountll
should do the trick; I was just concerned of unnecessary conversion on those platforms (or maybe that's optimized out, I'm ignorant on compiler magic)
Alternatively, if we still want to cover all of our bases, we could try this:
#if defined(__GNUC__) || defined(__clang__)
#if std::is_same_v<unsigned long long, uint64_t>
#if __has_builtin(__builtin_popcountll)
#define builtin_popcount_uint64 __builtin_popcountll
#endif // __has_builtin(__builtin_ffsll)
#if __has_builtin(__builtin_ffsll)
#define builtin_ffs_uint64 __builtin_ffsll
#endif // __has_builtin(__builtin_ffsll)
#else
#if __has_builtin(__builtin_popcountl)
#define builtin_popcount_uint64 __builtin_popcountl
#endif // __has_builtin(__builtin_ffsl)
#if __has_builtin(__builtin_ffsl)
#define builtin_ffs_uint64 __builtin_ffsl
#endif // __has_builtin(__builtin_ffsl)
#endif // std::is_same_v<unsigned long long, uint64_t>
#endif // defined(__GNUC__) || defined(__clang__)
…Though this is really excessive & probably not necessary.
Yeah, this will be a drop-in replacement for PagedAllocator, at least for the thread_safe=true version of it. |
This can actually be done using I will try to do this. If everything will work, then I will make PR. |
The block allocator still has different characteristics than this one though. The slab allocator of this PR is very fast, it spends almost no time under lock. I don't think an allocator intended for more generic use is going to beat this one, especially given Godot's allocation patterns. The vast majority of allocation happens on the main thread but deallocation happens all over the place. A benchmark needs to actually have Godot's allocation/deallocation patterns. A synthetic benchmark that tries to stress the allocator by massively allocating on multiple threads isn't going to scale to Godot. |
Under lock does not play any role in the performance of multithreading. The bottleneck is the SpinLock itself and the allocator data, since they are shared data between threads. In Test resultsVoxel:
tps-demo:
Custom SpinLock code: class SpinLock2 {
mutable std::atomic<bool> locked = ATOMIC_VAR_INIT(false);
public:
_ALWAYS_INLINE_ void lock() const {
static int thread_id_in_use;
while (true) {
bool expected = false;
if (locked.compare_exchange_weak(expected, true, std::memory_order_acquire, std::memory_order_relaxed)) {
break;
}
do {
static int count = 0;
count++;
print_line("Locked Count:", count, " thread id using shared resource:", thread_id_in_use, "waiting thread id:", Thread::get_caller_id());
_cpu_pause();
} while (locked.load(std::memory_order_relaxed));
}
thread_id_in_use = Thread::get_caller_id();
}
_ALWAYS_INLINE_ void unlock() const {
locked.store(false, std::memory_order_release);
}
}; The probability that SpinLock will block the thread is very unlikely (0-10 for generating one chunk, or 53 per minute of gameplay in tps-demo, most of which is during loading). Because the thread only blocks if we simultaneously allocate data in thread I haven't fully looked into the |
In almost all cases, this is in the padding between elements anyway, except when the items are already aligned. But this allocator was mode for Variant specifically where this is never true. The index is free in all cases where the allocator is used, and it means that there are no hashes, no lookups of any kind. Everything is just pointer arithmetic of stuff already in cache. It might indeed be slightly less friendly if the items already happen to be aligned and then it could potentially waste some space, but since it is directly after the item I don't think it is not cache friendly. The bitmap itself is also aligned such that it should land on a different cache line than the main structure. It really is quite fast :) |
The problem with this approach is that doing extra work in the spinlock invalidates the results. We need to actually benchmark the results in game. The odds of a spinlock locking are changed dramatically by how much work is being done in it. It's a bit of a Heisenberg problem 😄 If your allocator achieves better frame rates than mine does then it is better! I'm not saying it is not better, it very well might be! However I don't think your instrumented spinlock proves whether or not it is better. |
Well, in general, I wanted to note that blocking is not frequent. Even if there are 2-3 times more locks, it is not significant. |
8866162
to
404c6e0
Compare
New allocator for Variant.
Structures and classes cannot use padding between elements. In order to prove it to you, you can write: sizeof(ThreadSafeSlabAllocator<Variant::Pools::BucketSmall>::Item); This returns 28 bytes while BucketSmall itself 24. The same will be true for BucketMedium and BucketLarge. sizeof(ThreadSafeSlabAllocator<Variant>::Item); And this returns 32, not 24.
HashMap lookup is called only once for each thread. Using the id (which is already cached) we find our allocator in the allocator array, which is the same pointer arithmetic operation.
Now, since you asked to for benchmarks. I did the same tests as you in this PR, merged your PR with #97016 and compared these branches: You can see that my allocator has a higher FRS. In general, what I want to say. You wrote in rocket chat that you don't want to have a discussion with me and you'd better stop working on this. And I say your allocator is great, but you're just using the allocator in the wrong place because we can store the allocator metadata directly in the
I hope you agree with me) |
I'll address the rest of your post later, but I'd like to clarify that I really didn't mean to suggest I didn't want to talk with you. If you ping me on rocket.chat I'd be happy to explain what I meant. I'm very sorry if I gave you the impression that I just didn't want to discuss things with you. That is really not what I wanted to say.
You are right, I don't know what I was thinking. I thought I had measured this but apparently not. Or I tested something completely unrelated. |
New allocator for Variant.
Performance improvements measured using TPS demo, start the level and don't touch anything. Random spawn points have been removed.