Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ukernel lowering for data-tiled multi_mma with mfma_i32_16x16x32_i8 #19522

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Dec 18, 2024

This finishes implementing an initial ukernel for multi_mma for DataTiledMMAAttr with kind = mfma_i32_16x16x32_i8.

The ukernel takes unroll and subgroup parameters as function parameters. The idea is that once inlining works as intended, these function parameters will be constants and the optimized code will be the same as if we had hardcoded specific values. This inlining isn't happening at the moment, but that is a bug that we should fix first. It is happening in LLVMCPU, so that's probably something missing in LLVMGPU.

The ukernel file has a comment with a few TODOs to get from this initial naive ukernel to something faster. The first step is to fix the above-mentioned inlining problem, then get shared memory, then get better instruction scheduling.

@bjacob bjacob marked this pull request as ready for review December 18, 2024 22:19
@bjacob
Copy link
Contributor Author

bjacob commented Dec 19, 2024

Just thought of a problem: while it takes a int unroll_k parameter, the only value that works is 2, because it uses a fixed vector type with that size. I'll fix that tomorrow.

@MaheshRavishankar
Copy link
Contributor

Just thought of a problem: while it takes a int unroll_k parameter, the only value that works is 2, because it uses a fixed vector type with that size. I'll fix that tomorrow.

Drive by comment, aren't the unroll_k etc. implementation details of the compiler? Do they need to cross the ukernel api boundary? I would expect the ukernel to only worry about the problem size/architecture and not these details.

@bjacob
Copy link
Contributor Author

bjacob commented Dec 19, 2024

These unroll_{m,n,k} parameters control the tile shape and layout, of which the corresponding unrolling effect on the kernel code is just an implementation detail. Setting unroll_k = 2 causes the tile to have a 2x larger K-dimension size. See the effect here in the TileSwizzle calculation (which is where both tile shape and layout are decided):

if (mma.getUnrollK() > 1) {
expand(swizzle, 1, {Kind::CrossIntrinsic, mma.getUnrollK()});
int interleavingIdx =
getInnermostNonInternalDimIdx(swizzle.expandShape[1]);
interleave(swizzle, 1, interleavingIdx);
}

Comment on lines 627 to 632
// Preserve the lowering_config attribute for GPULowerToUKernelsPass.
constexpr char loweringConfigAttrName[] = "lowering_config";
if (mmaOp->hasAttr(loweringConfigAttrName)) {
newMmaOp->setAttr(loweringConfigAttrName,
mmaOp->getAttr(loweringConfigAttrName));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use kConfigAttrName.

constexpr StringLiteral kConfigAttrName = "lowering_config";

Question: do we preserve all the discardable attributes (i.e., additional attributes that are not defined by the op itself)? If so, you can do something like

newMmaOp->setDiscardableAttrs(mmaOp->getDiscardableAttrDictionary());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that worked! Earlier I had tried something with setAttrs / getAttrs and that caused other tests to fail. I didn't know about setDiscardableAttrs.

Comment on lines -623 to -629
auto newKind = mmaOp.getKind();
if (auto dataTiledMma = dyn_cast<DataTiledMMAAttr>(newKind)) {
newKind = DataTiledMMAAttr::get(
context, dataTiledMma.getIntrinsic(), dataTiledMma.getUnrollM(),
/*subgroups_m=*/1, dataTiledMma.getUnrollN(),
/*subgroups_n=*/1, dataTiledMma.getUnrollK());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not pay much attention on the changes of DataTiledMMAAttr. Why do we drop the newKind here? Does it impact codegen path? Or it is handled by attribute interface implementation? Thanks for your explanation in advance, and perhaps we can put the information to the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now transporting the old kind, unchanged. The code being deleted here was creating a new kind that only preserved the unroll_* parameters, but had the subgroups_* parameters set to 1. I thought that those parameters were inherently not needed after thread-distribution. That changed with ukernels. While it remains true that the subgroups_* parameters should not be needed after thread-distribution, in order to avoid needing them, codegen makes use of all the stride information for all the expanded dimensions. We could pass all these strides to the ukernels, but that woud be cumbersome, particularly as the number of dimension varies as unit dimensions are omitted. So in this case, passing the original DataTiledMMAAttr parameters and letting the ukernel infer the strides, results in much simpler code. The drawback is an interaction-at-a-distance in the layouts implied by these parameters, but I think that's OK.

Base automatically changed from users/bjacob/subgroup-dim-outer to main December 19, 2024 14:42
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
@bjacob bjacob force-pushed the users/bjacob/ukernel-lowering branch from e4fa8e7 to e23f5a2 Compare December 19, 2024 16:43
@bjacob
Copy link
Contributor Author

bjacob commented Dec 19, 2024

Just thought of a problem: while it takes a int unroll_k parameter, the only value that works is 2, because it uses a fixed vector type with that size. I'll fix that tomorrow.

Resolved.

@bjacob bjacob requested review from kuhar and hanhanW December 19, 2024 16:53
@bjacob bjacob merged commit fb4d094 into main Dec 19, 2024
43 checks passed
@bjacob bjacob deleted the users/bjacob/ukernel-lowering branch December 19, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants