-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Triton Compiler Takes Indefinite Time in ttgir -> llir Stage. #596
Comments
The following passes do not exist in the newer code triton/lib/Target/LLVMIR/LLVMIRTranslation.cpp Lines 481 to 484 in 9b73a54
Maybe we can try to add them to the make_llir function to see if it can fix the problem. |
Update 06/05/2024We have those two passes in upstream
The hanging happens in the add_builtin_func_to_llvmir pass: https://github.com/triton-lang/triton/blob/fa2271e37f4e0ccfa8829501e40533060937cfe5/third_party/amd/backend/compiler.py#L185 |
I am working on this, because it looks (very) slightly simpler than https://github.com/ROCm/triton-internal/issues/104 This is what I got so far:
|
Another note that might help
|
I guess the comment here bascially confirms that
|
So yes, I was aware of this comment, but @antiagainst was asking if there could be a simpler solution than implementing buffer loads. I guess the main question is :is this a bug or is this unavoidable because of so many blocks? Should you, me and @antiagainst have a chat to decide the best step forward? |
So this is the situation we have in the CFG (cc @antiagainst ): I think the problem is that MLIR is trying to produce a single big |
So, I think all in all this is a correct transformation, also in our case. What happens is that we meet the following cases: Store case
In this case those blocks can be merged, and the merged block will have +2 operands Insertelement case
In this case the blocks are still structurally similar, but we are doubling the number of input operands of the merged block. When we do that 64 times, we get to blocks that have Possible (quick) workaroundWe can introduce a threshold: don't merge the blocks if this results in more than |
A threshold is a good idea, I think. |
Yes, we can have an option like I want also underline that by not merging those blocks we are creating a super branchy code that will probably be very slow. So once I implement this, I will try to finish the buffer_load implementation |
Yup agreed that having a threshold in the greedy pattern rewriter configuration to control this would be good. Once you have the patch to mlir please add me as a reviewer. |
They were faster than me :) : llvm/llvm-project#95057 Not sure if the threshold solution is better or not, but I commented on the PR instead of creating a different one (note, once the PR is merged, we should upgrade Triton commit to get the change) |
@giuseros Have you verified the upstream PR will address the two use cases? This ticket and https://github.com/ROCm/triton-internal/issues/104 |
Yes, it disables block merging on canonicalization that is the root cause of both. |
Update on this: they made a further change (or the change was there and it skipped my eye) for which they now enable block-merging in the rewriter. If they stick with that, we will have the hang (see llvm/llvm-project#95057 (comment)) Either we convince them to disable merging into the rewriter, or I will have (urgently) to implement this: |
After thinking about this, I guess we can set:
When we instantiate the rewriter. And meanwhile I can work on llvm/llvm-project#63230 to solve the core issue. |
I was trying to follow you discussion with Mehdi on that upstream PR. What does it mean by "they disable block merging for canonicalization but enable it for rewriter"? |
Both the canonicalize pass and the rewriter use the |
Does the rewriter call simplifyRegions (and probably other passes to canonicalize stuff) after it matches and rewrites all the ops?
We only need to set it for the rewriter in |
Yes
And anytime we invoke the rewriter after that. I see that builtin_func_to_llvm is the last pass, so it shouldn't be an issue. Of course there is the core drawback that we will disable block merging in all cases. But this is something we can worry later (and I will try to work on it in my "spare" time) |
Problem Description
Full source code to reproduce:
rep.py.gz
Triton version: upstream
d688063f731cfc4d9431bb8c0d0d73dce8cd1c38
Docker Container:
rocm/pytorch-private:compute-rocm-rel-6.1-116_ubuntu22.04_py3.9_pytorch_rocm6.1_internal_testing_ae01701
Can be reproduced in both MI200(gfx90a) and Navi3x. Debugging print shows the compiler hangs during ttgir->llir stage.
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD Ryzen Threadripper PRO 5975WX 32-Cores
GPU
AMD Instinct MI210
ROCm Version
ROCm 6.1.0
ROCm Component
No response
Steps to Reproduce
Download the rep.py.gz in the Description section, and then
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: