-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU data races as LLVMGPUDistribute has ops distributed to multiple threads performing the same memory accesses #20358
Comments
Note that the LLVMGPUDistribute pipeline is set in 3 cases in compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp:
Case 1. is what we have seen as numerical failures on CI, because the data races in Sort are leading to different numerical results. Case 3. has not been observed as a CI failure so far but it may be only because the data race in that case is silent: different threads would be writing the same value to the same location. Even though that does not manifest itself in incorrect numerical results so far, that is still a data race, that is still undefined behavior in LLVM IR's concurrent memory model. Case 2. has not been observed yet; it might be only because Fft is rarely used. Depending on how exactly it is implemented (how it uses intermediate buffers) it might be similar to either case 1 or case 3. |
The LinalgExt::SortOp is one of several ops that are handled by LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing memory accesses which are causing certain tests to become flaky. So this op is being moved to the more robust LLVMGPUTileAndFuse pipeline.
The LinalgExt::SortOp is one of several ops that are handled by LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing memory accesses which are causing certain tests to become flaky. So this op is being moved to the more robust LLVMGPUTileAndFuse pipeline. --------- Signed-off-by: Muzammil <muzasyed@amd.com>
The LinalgExt::SortOp is one of several ops that are handled by LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing memory accesses which are causing certain tests to become flaky. So this op is being moved to the more robust LLVMGPUTileAndFuse pipeline. --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Created PR for the first case that Benoit identified in an earlier comment: LinalgExt::SortOp. Will create follow up PRs for Case 2 and 3 as well. |
The LinalgExt::SortOp is one of several ops that are handled by LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing memory accesses which are causing certain tests to become flaky. So this op is being moved to the more robust LLVMGPUTileAndFuse pipeline. --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
The LinalgExt::SortOp is one of several ops that are handled by LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing memory accesses which are causing certain tests to become flaky. So this op is being moved to the more robust LLVMGPUTileAndFuse pipeline. --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
We are seeing many CI flakes with multiple ROCm targets (RDNA3, CDNA3, CDNA2). The common theme seems to be ops that are handled by LLVMGPUDistribute that end up not really being thread-distributed, so that all threads end up performing the exact same memory accesses, racing with each other.
As discussed on discord and in #18649 and #20327.
Sample failure: https://github.com/iree-org/iree/actions/runs/14038781847/job/39303988986?pr=20355
Isolating this
sort3D
function as a testcase:Compiling:
-> print-IR-after-all log: https://gist.github.com/bjacob/1dfc92d50e61865304b9eaf05fcb6a3f
-> target asm: https://gist.github.com/bjacob/908ace1302fb8dd98965c241156d3145
A workgroup size of 128 is being selected here: https://gist.github.com/bjacob/1dfc92d50e61865304b9eaf05fcb6a3f#file-log-mlir-L11874
But no corresponding thread-distribution of the code is done.
Should probably instead select the smallest possible workgroup size (e.g. subgroup size) and then have a
if (lane_id == 0) {...}
in the code?@qedawkins suggested it might be a matter of switching to LLVMGPUTileAndFuse: https://discord.com/channels/689900678990135345/1353760643668512931/1353768473167134784
The text was updated successfully, but these errors were encountered: