Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU data races as LLVMGPUDistribute has ops distributed to multiple threads performing the same memory accesses #20358

Open
bjacob opened this issue Mar 24, 2025 · 2 comments
Assignees
Labels
bug 🐞 Something isn't working codegen/rocm ROCm code generation compiler backend (HIP/HSA)

Comments

@bjacob
Copy link
Contributor

bjacob commented Mar 24, 2025

We are seeing many CI flakes with multiple ROCm targets (RDNA3, CDNA3, CDNA2). The common theme seems to be ops that are handled by LLVMGPUDistribute that end up not really being thread-distributed, so that all threads end up performing the exact same memory accesses, racing with each other.

As discussed on discord and in #18649 and #20327.

Sample failure: https://github.com/iree-org/iree/actions/runs/14038781847/job/39303988986?pr=20355

122/176 Test  #733: iree/tests/e2e/stablehlo_ops/check_rocm_hip_stream_sort.mlir .....................................***Failed    0.53 sec
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from module
[ RUN      ] module.sort1D
[       OK ] module.sort1D (405 ms)
[ RUN      ] module.sort2D
[       OK ] module.sort2D (26 ms)
[ RUN      ] module.sort3D
iree/runtime/src/iree/modules/check/module.cc:478: Failure
Failed
Expected equality of these values. Contents does not match.
  lhs:
    1x2x4xi32=[[1 2 3 4][1 2 4 4]]
  rhs:
    1x2x4xi32=[[1 2 3 4][1 2 3 4]]

Isolating this sort3D function as a testcase:

func.func @sort3D() {
  %input = util.unfoldable_constant dense<[[[1, 2, 3, 4],
                                            [4, 3, 2, 1]]]> : tensor<1x2x4xi32>

  %sort = "stablehlo.sort"(%input) ( {
  ^bb0(%arg1: tensor<i32>, %arg2: tensor<i32>):  // no predecessors
    %compare = "stablehlo.compare"(%arg1, %arg2) {comparison_direction = #stablehlo<comparison_direction LT>} : (tensor<i32>, tensor<i32>) -> tensor<i1>
    "stablehlo.return"(%compare) : (tensor<i1>) -> ()
  }) {dimension = 2 : i64, is_stable = false} : (tensor<1x2x4xi32>) -> tensor<1x2x4xi32>

  check.expect_eq_const(%sort, dense<[[[1, 2, 3, 4], [1, 2, 3, 4]]]> : tensor<1x2x4xi32>) : tensor<1x2x4xi32>
  return
}

Compiling:

tools/iree-compile --output-format=vm-bytecode --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-input-type=stablehlo --iree-hip-target=gfx942 ~/a.mlir -o  ~/a.vmfb

-> print-IR-after-all log: https://gist.github.com/bjacob/1dfc92d50e61865304b9eaf05fcb6a3f
-> target asm: https://gist.github.com/bjacob/908ace1302fb8dd98965c241156d3145

A workgroup size of 128 is being selected here: https://gist.github.com/bjacob/1dfc92d50e61865304b9eaf05fcb6a3f#file-log-mlir-L11874

But no corresponding thread-distribution of the code is done.

Should probably instead select the smallest possible workgroup size (e.g. subgroup size) and then have a if (lane_id == 0) {...} in the code?

@qedawkins suggested it might be a matter of switching to LLVMGPUTileAndFuse: https://discord.com/channels/689900678990135345/1353760643668512931/1353768473167134784

@bjacob bjacob added bug 🐞 Something isn't working codegen/rocm ROCm code generation compiler backend (HIP/HSA) labels Mar 24, 2025
@bjacob
Copy link
Contributor Author

bjacob commented Mar 25, 2025

Note that the LLVMGPUDistribute pipeline is set in 3 cases in compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp:

  1. For LinalgExt::SortOp (setSortConfig). I guess that TopK also bottoms out on Sort and so hits that case too.
  2. For LinalgExt::FftOp (setFftConfig).
  3. For untuned Linalg ops (setRootDefaultConfig).

Case 1. is what we have seen as numerical failures on CI, because the data races in Sort are leading to different numerical results.

Case 3. has not been observed as a CI failure so far but it may be only because the data race in that case is silent: different threads would be writing the same value to the same location. Even though that does not manifest itself in incorrect numerical results so far, that is still a data race, that is still undefined behavior in LLVM IR's concurrent memory model.

Case 2. has not been observed yet; it might be only because Fft is rarely used. Depending on how exactly it is implemented (how it uses intermediate buffers) it might be similar to either case 1 or case 3.

Muzammiluddin-Syed-ECE added a commit to Muzammiluddin-Syed-ECE/iree that referenced this issue Mar 28, 2025
The LinalgExt::SortOp is one of several ops that are handled by
LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing
memory accesses which are causing certain tests to become flaky. So this
op is being moved to the more robust LLVMGPUTileAndFuse pipeline.
Muzammiluddin-Syed-ECE added a commit to Muzammiluddin-Syed-ECE/iree that referenced this issue Mar 28, 2025
The LinalgExt::SortOp is one of several ops that are handled by
LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing
memory accesses which are causing certain tests to become flaky. So this
op is being moved to the more robust LLVMGPUTileAndFuse pipeline.

---------

Signed-off-by: Muzammil <muzasyed@amd.com>
Muzammiluddin-Syed-ECE added a commit to Muzammiluddin-Syed-ECE/iree that referenced this issue Mar 28, 2025
The LinalgExt::SortOp is one of several ops that are handled by
LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing
memory accesses which are causing certain tests to become flaky. So this
op is being moved to the more robust LLVMGPUTileAndFuse pipeline.

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
@Muzammiluddin-Syed-ECE
Copy link
Contributor

Muzammiluddin-Syed-ECE commented Mar 28, 2025

Created PR for the first case that Benoit identified in an earlier comment: LinalgExt::SortOp.

Will create follow up PRs for Case 2 and 3 as well.

Muzammiluddin-Syed-ECE added a commit to Muzammiluddin-Syed-ECE/iree that referenced this issue Mar 28, 2025
The LinalgExt::SortOp is one of several ops that are handled by
LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing
memory accesses which are causing certain tests to become flaky. So this
op is being moved to the more robust LLVMGPUTileAndFuse pipeline.

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Muzammiluddin-Syed-ECE added a commit to Muzammiluddin-Syed-ECE/iree that referenced this issue Apr 1, 2025
The LinalgExt::SortOp is one of several ops that are handled by
LLVMGPUDistribute. This pipeline mismanages thread distribution resulting in racing
memory accesses which are causing certain tests to become flaky. So this
op is being moved to the more robust LLVMGPUTileAndFuse pipeline.

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/rocm ROCm code generation compiler backend (HIP/HSA)
Projects
None yet
Development

No branches or pull requests

3 participants