[Issue]: Triton Compiler Takes Indefinite Time in ttgir -> llir Stage. #596

xinyazhang · 2024-06-03T16:14:45Z

Problem Description

Full source code to reproduce:
rep.py.gz

Triton version: upstream d688063f731cfc4d9431bb8c0d0d73dce8cd1c38
Docker Container: rocm/pytorch-private:compute-rocm-rel-6.1-116_ubuntu22.04_py3.9_pytorch_rocm6.1_internal_testing_ae01701

Can be reproduced in both MI200(gfx90a) and Navi3x. Debugging print shows the compiler hangs during ttgir->llir stage.

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD Ryzen Threadripper PRO 5975WX 32-Cores

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

Download the rep.py.gz in the Description section, and then

gunzip rep.py.gz
python rep.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

xinyazhang · 2024-06-04T08:33:55Z

The following passes do not exist in the newer code

triton/lib/Target/LLVMIR/LLVMIRTranslation.cpp

Lines 481 to 484 in 9b73a54

    
           #ifdef USE_ROCM 
        
             pm.addPass(mlir::createConvertSCFToCFPass()); 
        
             pm.addPass(createConvertControlFlowToLLVMPass()); 
        
           #endif

Maybe we can try to add them to the make_llir function to see if it can fix the problem.

zhanglx13 · 2024-06-05T16:10:49Z

Update 06/05/2024

We have those two passes in upstream

The hanging happens in the add_builtin_func_to_llvmir pass: https://github.com/triton-lang/triton/blob/fa2271e37f4e0ccfa8829501e40533060937cfe5/third_party/amd/backend/compiler.py#L185

giuseros · 2024-06-10T14:09:12Z

I am working on this, because it looks (very) slightly simpler than https://github.com/ROCm/triton-internal/issues/104

This is what I got so far:

The bug (as mentioned previously) originates from convert-builtin-func-to-llvm
The main issue for this specific case, is about the stores. I made a reproduction bug.mlir which is the file just before convert-builtin-func-to-llvm (attached to this comment) and I commented out all the loads and some of the stores. While triton-opt terminates, the output produced is massively large. The more stores we add back into bug.mlir, the more time it takes to complete (I think that if we leave it long enough it will eventually complete)
The source of the issue looks like the mergeIdenticalBlocks transformation contained in the simplifyRegion utility. If I disable that transformation enableRegionSimplify=false then compilation is quite quick.
I produced a disable_simplify.mlir output that comes from bug.mlir when passing enableRegionSimplify=false to the rewriter. If we do: triton-opt --canonicalize disable_simplify.mlir we see the same masssive output as befreo with triton-opt that takes some time to finish. Instead, if we do: triton-opt --canonicalize="region-simplify=false" disable_simplify.mlir the output is normal, and triton-opt terminates quickly.

repro.zip

jayfurmanek · 2024-06-10T15:22:27Z

Another note that might help
On the repro script, if I break the nested if below on line 367 (by just deleting the else there), then it doesn't hang.
Perhaps this is related to stores in nested-if statements.

    if q_padded:
        if PADDED_HEAD:
            tl.store(O_block_ptr, acc.to(Out.type.element_ty), boundary_check=(0,1))
        else:
            tl.store(O_block_ptr, acc.to(Out.type.element_ty), boundary_check=(0,))
    else:
        if PADDED_HEAD:
            tl.store(O_block_ptr, acc.to(Out.type.element_ty), boundary_check=(1,))
        else:
            tl.store(O_block_ptr, acc.to(Out.type.element_ty))

jayfurmanek · 2024-06-10T15:26:50Z

I guess the comment here bascially confirms that

  # This pass (`add_builtin_func_to_llvmir`) serves as a temporary workaround to address the issue of excessive basic block
        # count caused by predicated loads/stores. In certain kernels, the addition of these blocks can cause the MLIR
        # canonicalizer to never finish when attempting to merge blocks. The permanent solution under consideration
        # involves using MUBUF instructions that have built-in out-of-bounds checks, which would eliminate the need
        # for conditional branching around memory accesses.

giuseros · 2024-06-10T15:30:04Z

So yes, I was aware of this comment, but @antiagainst was asking if there could be a simpler solution than implementing buffer loads. I guess the main question is :is this a bug or is this unavoidable because of so many blocks?

Should you, me and @antiagainst have a chat to decide the best step forward?

giuseros · 2024-06-10T17:02:21Z

So this is the situation we have in the CFG (cc @antiagainst ):

I think the problem is that MLIR is trying to produce a single big if-block where to put all those subgraphs

giuseros · 2024-06-10T19:00:30Z

So, I think all in all this is a correct transformation, also in our case. What happens is that we meet the following cases:

Store case

leader block:
^bb152:  // pred: ^bb151
  llvm.store %3316, %3245 : i16, !llvm.ptr<1>
  llvm.br ^bb153
blocks to merge:
^bb181:  // pred: ^bb180
  "llvm.store"(%3409, %3375) <{ordering = 0 : i64}> : (i16, !llvm.ptr<1>) -> ()
  "llvm.br"()[^bb153] : () -> ()

In this case those blocks can be merged, and the merged block will have +2 operands

Insertelement case

^bb151:  // 2 preds: ^bb149, ^bb150
  %3315 = llvm.insertelement %3148, %3251[%60 : i32] : vector<1xf16>
  %3316 = llvm.bitcast %3315 : vector<1xf16> to i16
  llvm.cond_br %3254, ^bb152(%3316, %3245 : i16, !llvm.ptr<1>), ^bb153
blocks to merge:
^bb180:  // 2 preds: ^bb178, ^bb179
  %3412 = "llvm.insertelement"(%3385, %3336, %60) : (vector<1xf16>, f16, i32) -> vector<1xf16>
  %3413 = "llvm.bitcast"(%3412) : (vector<1xf16>) -> i16
  "llvm.cond_br"(%3384, %3413, %3379)[^bb152, ^bb153] <{operandSegmentSizes = array<i32: 1, 2, 0>}> : (i1, i16, !llvm.ptr<1>) -> ()

In this case the blocks are still structurally similar, but we are doubling the number of input operands of the merged block. When we do that 64 times, we get to blocks that have 32764 input operands which is very slow to handle.

Possible (quick) workaround

We can introduce a threshold: don't merge the blocks if this results in more than K (defaulted to 16?) input operands in the resulting block

jayfurmanek · 2024-06-10T19:34:13Z

A threshold is a good idea, I think.
Where would we implement the threshold? In the canonicalizer?

giuseros · 2024-06-10T19:42:55Z

Yes, we can have an option like maxBlockArguments in the canonicalizer pass defaulted to 16. I tried to hardcode that and indeed it works fine. I will try to update a patch.

I want also underline that by not merging those blocks we are creating a super branchy code that will probably be very slow. So once I implement this, I will try to finish the buffer_load implementation

antiagainst · 2024-06-10T20:20:29Z

Yup agreed that having a threshold in the greedy pattern rewriter configuration to control this would be good. Once you have the patch to mlir please add me as a reviewer.

giuseros · 2024-06-11T08:44:32Z

They were faster than me :) : llvm/llvm-project#95057

Not sure if the threshold solution is better or not, but I commented on the PR instead of creating a different one

(note, once the PR is merged, we should upgrade Triton commit to get the change)

jerryyin · 2024-06-11T19:28:41Z

@giuseros Have you verified the upstream PR will address the two use cases? This ticket and https://github.com/ROCm/triton-internal/issues/104

giuseros · 2024-06-12T09:22:57Z

Yes, it disables block merging on canonicalization that is the root cause of both.

giuseros · 2024-06-12T12:49:11Z

Update on this: they made a further change (or the change was there and it skipped my eye) for which they now enable block-merging in the rewriter. If they stick with that, we will have the hang (see llvm/llvm-project#95057 (comment))

Either we convince them to disable merging into the rewriter, or I will have (urgently) to implement this:

[mlir] mergeIdenticalBlocks is blowing up on a conditional to identical CFGs llvm/llvm-project#63230

giuseros · 2024-06-12T13:33:14Z

After thinking about this, I guess we can set:

 GreedySimplifyRegionLevel enableRegionSimplification =      GreedySimplifyRegionLevel::Normal;

When we instantiate the rewriter. And meanwhile I can work on llvm/llvm-project#63230 to solve the core issue.

zhanglx13 · 2024-06-12T13:37:15Z

I was trying to follow you discussion with Mehdi on that upstream PR. What does it mean by "they disable block merging for canonicalization but enable it for rewriter"?

giuseros · 2024-06-12T13:42:13Z

Both the canonicalize pass and the rewriter use the simplifyRegions function. The solution Mehdi is proposing is to default to simplifyRegions(normal) in the canonicalize pass (block merging disabled) and simplifyRegions(aggressive) in the rewriter (block merging enabled -> hang). We can change the default behaviour of the rewriter in Triton (so that it calls simplifyRegions(normal), but this means setting passing a config every time we invoke it (with config.enableRegionSimplification =Normal )

zhanglx13 · 2024-06-12T13:49:40Z

Does the rewriter call simplifyRegions (and probably other passes to canonicalize stuff) after it matches and rewrites all the ops?

We can change the default behaviour of the rewriter in Triton (so that it calls simplifyRegions(normal), but this means setting passing a config every time we invoke it (with config.enableRegionSimplification =Normal )

We only need to set it for the rewriter in builtin_func_to_llvm pass. right? If so, are there any other drawbacks ?

giuseros · 2024-06-12T13:57:58Z

Does the rewriter call simplifyRegions (and probably other passes to canonicalize stuff) after it matches and rewrites all the ops?

Yes

We only need to set it for the rewriter in builtin_func_to_llvm pass. right? If so, are there any other drawbacks ?

And anytime we invoke the rewriter after that. I see that builtin_func_to_llvm is the last pass, so it shouldn't be an issue.

Of course there is the core drawback that we will disable block merging in all cases. But this is something we can worry later (and I will try to work on it in my "spare" time)

xinyazhang assigned jayfurmanek Jun 3, 2024

xinyazhang mentioned this issue Jun 3, 2024

[Feature]: Memory Efficient Flash Attention for gfx1100 (7900xtx) ROCm/aotriton#16

Closed

zhanglx13 assigned zhanglx13 and unassigned jayfurmanek Jun 5, 2024

jayfurmanek assigned giuseros and jayfurmanek Jun 10, 2024

jerryyin unassigned zhanglx13 Jun 14, 2024

xinyazhang added a commit to ROCm/aotriton that referenced this issue Jun 19, 2024

Mitigate compiler bug (ROCm/triton#596)

db667df

Beinsezii mentioned this issue Jun 20, 2024

[Bug]: For RDNA3 (navi31; gfx1100) VLLM_USE_TRITON_FLASH_ATTN=0 currently must be forced vllm-project/vllm#4514

Closed

giuseros closed this as completed Jun 26, 2024

xinyazhang added a commit to ROCm/aotriton that referenced this issue Jul 15, 2024

Mitigate compiler bug (ROCm/triton#596)

cd9615c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Triton Compiler Takes Indefinite Time in ttgir -> llir Stage. #596

[Issue]: Triton Compiler Takes Indefinite Time in ttgir -> llir Stage. #596

xinyazhang commented Jun 3, 2024

xinyazhang commented Jun 4, 2024

zhanglx13 commented Jun 5, 2024

giuseros commented Jun 10, 2024 •

edited

Loading

jayfurmanek commented Jun 10, 2024

jayfurmanek commented Jun 10, 2024 •

edited

Loading

giuseros commented Jun 10, 2024

giuseros commented Jun 10, 2024 •

edited

Loading

giuseros commented Jun 10, 2024

jayfurmanek commented Jun 10, 2024

giuseros commented Jun 10, 2024

antiagainst commented Jun 10, 2024

giuseros commented Jun 11, 2024 •

edited

Loading

jerryyin commented Jun 11, 2024

giuseros commented Jun 12, 2024

giuseros commented Jun 12, 2024 •

edited

Loading

giuseros commented Jun 12, 2024 •

edited

Loading

zhanglx13 commented Jun 12, 2024

giuseros commented Jun 12, 2024

zhanglx13 commented Jun 12, 2024

giuseros commented Jun 12, 2024

[Issue]: Triton Compiler Takes Indefinite Time in ttgir -> llir Stage. #596

[Issue]: Triton Compiler Takes Indefinite Time in ttgir -> llir Stage. #596

Comments

xinyazhang commented Jun 3, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

xinyazhang commented Jun 4, 2024

zhanglx13 commented Jun 5, 2024

Update 06/05/2024

giuseros commented Jun 10, 2024 • edited Loading

jayfurmanek commented Jun 10, 2024

jayfurmanek commented Jun 10, 2024 • edited Loading

giuseros commented Jun 10, 2024

giuseros commented Jun 10, 2024 • edited Loading

giuseros commented Jun 10, 2024

Store case

Insertelement case

Possible (quick) workaround

jayfurmanek commented Jun 10, 2024

giuseros commented Jun 10, 2024

antiagainst commented Jun 10, 2024

giuseros commented Jun 11, 2024 • edited Loading

jerryyin commented Jun 11, 2024

giuseros commented Jun 12, 2024

giuseros commented Jun 12, 2024 • edited Loading

giuseros commented Jun 12, 2024 • edited Loading

zhanglx13 commented Jun 12, 2024

giuseros commented Jun 12, 2024

zhanglx13 commented Jun 12, 2024

giuseros commented Jun 12, 2024

giuseros commented Jun 10, 2024 •

edited

Loading

jayfurmanek commented Jun 10, 2024 •

edited

Loading

giuseros commented Jun 10, 2024 •

edited

Loading

giuseros commented Jun 11, 2024 •

edited

Loading

giuseros commented Jun 12, 2024 •

edited

Loading

giuseros commented Jun 12, 2024 •

edited

Loading