Ukernel lowering for data-tiled `multi_mma` with `mfma_i32_16x16x32_i8` #19522

bjacob · 2024-12-18T22:13:35Z

This finishes implementing an initial ukernel for multi_mma for DataTiledMMAAttr with kind = mfma_i32_16x16x32_i8.

The ukernel takes unroll and subgroup parameters as function parameters. The idea is that once inlining works as intended, these function parameters will be constants and the optimized code will be the same as if we had hardcoded specific values. This inlining isn't happening at the moment, but that is a bug that we should fix first. It is happening in LLVMCPU, so that's probably something missing in LLVMGPU.

The ukernel file has a comment with a few TODOs to get from this initial naive ukernel to something faster. The first step is to fix the above-mentioned inlining problem, then get shared memory, then get better instruction scheduling.

bjacob · 2024-12-19T00:38:33Z

Just thought of a problem: while it takes a int unroll_k parameter, the only value that works is 2, because it uses a fixed vector type with that size. I'll fix that tomorrow.

MaheshRavishankar · 2024-12-19T02:08:43Z

Just thought of a problem: while it takes a int unroll_k parameter, the only value that works is 2, because it uses a fixed vector type with that size. I'll fix that tomorrow.

Drive by comment, aren't the unroll_k etc. implementation details of the compiler? Do they need to cross the ukernel api boundary? I would expect the ukernel to only worry about the problem size/architecture and not these details.

bjacob · 2024-12-19T02:10:51Z

These unroll_{m,n,k} parameters control the tile shape and layout, of which the corresponding unrolling effect on the kernel code is just an implementation detail. Setting unroll_k = 2 causes the tile to have a 2x larger K-dimension size. See the effect here in the TileSwizzle calculation (which is where both tile shape and layout are decided):

iree/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/GPUTileSwizzleUtils.cpp

Lines 150 to 155 in e553425

    
           if (mma.getUnrollK() > 1) { 
        
             expand(swizzle, 1, {Kind::CrossIntrinsic, mma.getUnrollK()}); 
        
             int interleavingIdx = 
        
                 getInnermostNonInternalDimIdx(swizzle.expandShape[1]); 
        
             interleave(swizzle, 1, interleavingIdx); 
        
           }

tests/e2e/matmul/generate_e2e_matmul_tests.py

compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/Passes.td

hanhanW · 2024-12-19T04:02:44Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp

+  // Preserve the lowering_config attribute for GPULowerToUKernelsPass.
+  constexpr char loweringConfigAttrName[] = "lowering_config";
+  if (mmaOp->hasAttr(loweringConfigAttrName)) {
+    newMmaOp->setAttr(loweringConfigAttrName,
+                      mmaOp->getAttr(loweringConfigAttrName));
+  }


nit: use kConfigAttrName.

iree/compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h

Line 47 in e553425

constexpr StringLiteral kConfigAttrName = "lowering_config";

Question: do we preserve all the discardable attributes (i.e., additional attributes that are not defined by the op itself)? If so, you can do something like

newMmaOp->setDiscardableAttrs(mmaOp->getDiscardableAttrDictionary());

Thanks, that worked! Earlier I had tried something with setAttrs / getAttrs and that caused other tests to fail. I didn't know about setDiscardableAttrs.

hanhanW · 2024-12-19T04:06:12Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp

-  auto newKind = mmaOp.getKind();
-  if (auto dataTiledMma = dyn_cast<DataTiledMMAAttr>(newKind)) {
-    newKind = DataTiledMMAAttr::get(
-        context, dataTiledMma.getIntrinsic(), dataTiledMma.getUnrollM(),
-        /*subgroups_m=*/1, dataTiledMma.getUnrollN(),
-        /*subgroups_n=*/1, dataTiledMma.getUnrollK());
-  }


I did not pay much attention on the changes of DataTiledMMAAttr. Why do we drop the newKind here? Does it impact codegen path? Or it is handled by attribute interface implementation? Thanks for your explanation in advance, and perhaps we can put the information to the PR description.

We are now transporting the old kind, unchanged. The code being deleted here was creating a new kind that only preserved the unroll_* parameters, but had the subgroups_* parameters set to 1. I thought that those parameters were inherently not needed after thread-distribution. That changed with ukernels. While it remains true that the subgroups_* parameters should not be needed after thread-distribution, in order to avoid needing them, codegen makes use of all the stride information for all the expanded dimensions. We could pass all these strides to the ukernels, but that woud be cumbersome, particularly as the number of dimension varies as unit dimensions are omitted. So in this case, passing the original DataTiledMMAAttr parameters and letting the ukernel infer the strides, results in much simpler code. The drawback is an interaction-at-a-distance in the layouts implied by these parameters, but I think that's OK.

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

bjacob · 2024-12-19T16:44:08Z

Just thought of a problem: while it takes a int unroll_k parameter, the only value that works is 2, because it uses a fixed vector type with that size. I'll fix that tomorrow.

Resolved.

bjacob marked this pull request as ready for review December 18, 2024 22:19

bjacob requested review from kuhar, MaheshRavishankar, qedawkins, Groverkss and antiagainst as code owners December 18, 2024 22:19

bjacob commented Dec 19, 2024

View reviewed changes

tests/e2e/matmul/generate_e2e_matmul_tests.py Outdated Show resolved Hide resolved

hanhanW reviewed Dec 19, 2024

View reviewed changes

kuhar reviewed Dec 19, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToUKernels.cpp Outdated Show resolved Hide resolved

Base automatically changed from users/bjacob/subgroup-dim-outer to main December 19, 2024 14:42

ukernel-lowering

e23f5a2

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

bjacob force-pushed the users/bjacob/ukernel-lowering branch from e4fa8e7 to e23f5a2 Compare December 19, 2024 16:43

bjacob requested review from kuhar and hanhanW December 19, 2024 16:53

hanhanW approved these changes Dec 19, 2024

View reviewed changes

bjacob merged commit fb4d094 into main Dec 19, 2024
43 checks passed

bjacob deleted the users/bjacob/ukernel-lowering branch December 19, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ukernel lowering for data-tiled `multi_mma` with `mfma_i32_16x16x32_i8` #19522

Ukernel lowering for data-tiled `multi_mma` with `mfma_i32_16x16x32_i8` #19522

bjacob commented Dec 18, 2024 •

edited

Loading

bjacob commented Dec 19, 2024

MaheshRavishankar commented Dec 19, 2024

bjacob commented Dec 19, 2024 •

edited

Loading

hanhanW Dec 19, 2024

bjacob Dec 19, 2024

hanhanW Dec 19, 2024

bjacob Dec 19, 2024

bjacob commented Dec 19, 2024

Ukernel lowering for data-tiled multi_mma with mfma_i32_16x16x32_i8 #19522

Ukernel lowering for data-tiled multi_mma with mfma_i32_16x16x32_i8 #19522

Conversation

bjacob commented Dec 18, 2024 • edited Loading

bjacob commented Dec 19, 2024

MaheshRavishankar commented Dec 19, 2024

bjacob commented Dec 19, 2024 • edited Loading

hanhanW Dec 19, 2024

Choose a reason for hiding this comment

bjacob Dec 19, 2024

Choose a reason for hiding this comment

hanhanW Dec 19, 2024

Choose a reason for hiding this comment

bjacob Dec 19, 2024

Choose a reason for hiding this comment

bjacob commented Dec 19, 2024

Ukernel lowering for data-tiled `multi_mma` with `mfma_i32_16x16x32_i8` #19522

Ukernel lowering for data-tiled `multi_mma` with `mfma_i32_16x16x32_i8` #19522

bjacob commented Dec 18, 2024 •

edited

Loading

bjacob commented Dec 19, 2024 •

edited

Loading