Switch to upstream Triton compiler, and related changes #36

xinyazhang · 2024-07-15T23:14:40Z

Switch to performance kernel for forward pass. The old Triton kernel does not work with new compiler
Support AOT based autotune, which includes
1. Add argument aotriton::v2::flash::ExtraArguments to all aotriton::v2::flash APIs
2. Add build option AOTRITON_BUILD_FOR_TUNING to build all possible GPU kernels. The configurations are supplied by KernelDescription.gen_autotune_configs, which is compatible with triton.Config.
3. AOTRITON_BUILD_FOR_TUNING also enables force_kernel_index and other fields to aotriton::v2::flash::ExtraArguments. Users can manually select kernel and bypass the autotune mechanism.
4. Add test/tune_flash.py cpp_autotune.py and change test/attn_torch_function.py to support AOT autotune (aka cpp autotune)
  - The test/tune_flash.py will run UT before testing a triton.Config's performance, to avoid including faulty kernels.
Add Navi31/32 compiler options (but not added to the default config due to compiler problems)
Add --use_multigpu to test/tune_flash.py. Now this script support tuning GPU kernels on all GPUs simultaneously, and the following extra features:
- It also put the UT to a separate process (referred as minesweeper process here), in case the faulty kernel triggers a segfault and crashes the worker process.
  - Thus the tune_flash.py needs 1*(main)+n*(worker)+n*(minesweeper)+1*(db access)+1*(table_tool.py) processes
  - For better performance, the minesweeper process is reused and only get recreated if the previous one hit segfault (or other failures).
- --json_file is also added since the new architecture has a unified database access process that accept outputs from all worker processes, and this new process can write to a separate json file. This is current recommended way to store the result of tuning script. Users are supposed to run v2python.table_tool later to update the tuning database.
- --continue_from_json_file is introduced. Meanwhile resultand_debug_task_idfields are also attached to the output json object, so that a tuning process can be resumed according to the_debug_task_id` and its tuning status
- v2python.table_tool is improved to support the new version of json file
Tuning results of the forward kernel are updated for MI200/MI300X +new compiler. Most UTs passed (see comments for known failures on MI300X)

CAVEAT: The new AOT based autotune script test/tune_flash.py isn't capable of handling backward pass yet.

…store

The old pattern triggers a compiler bug

* Add new build option AOTRITON_BUILD_FOR_TUNING. Required by all features below. * KernelDescription.gen_autotune_configs is introduced to specify autotune.Config objects. All kernels will be built by the build system. * Users now can select kernel through ExtraArguments::force_kernel_index

… AOTRITON_BUILD_FOR_TUNING=1

…OPTION is empty

This is to address is_causal=true while seqlen_q != seqlen_k cases. For these cases the inputs are not supported and this solution makes sense. However for more complicated cases interpolation will be needed.

…andling accordingly. Without type annotation the compiler behaves differently in JIT mode and then bugs will slip into AOT mode.

Note the database has not been updated yet.

The only things left are bias with very irregular shapes (e.g., False-1.2-dtype0-0.0-4-2048-32-1-1). Maybe due to tuning database extrapolation problems.

Skip unaccepted inputs (notably causual=True has lots requirements) Add more sequence lengths

… GPUs. Confirmed with amd-smi -w 2

Mainly autotune-related scripts

…erator Now the whole pipeline is: TunerManager -(mp.Q)-> [Worker] * N -(mp.Q)-> DbAccessor -(stdio)-> table_tool.py

and be less verbose.

However this seriously reduced the tuning process. Thinking of alternatives..

Performance restored to normal with this trick.

Now it is possible to continue a failed tuninng (maybe due to various limitations like power failure and full disk) CAVEAT: this option assumes the new pass uses the same set of samples (defined by --seqlen_q/k, etc. options). The correctness is not guarateed if different sets of samples were used.

xinyazhang · 2024-07-15T23:16:59Z

Known failures:

test_op_bwd_with_matrix_bias[False-1.2-dtype2-0.0-2048-143-256-4-4]

FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True

Produced by pytest test/test_backward.py -v -k 1.2 on MI300X

xinyazhang · 2024-07-15T23:18:43Z

The UT on MI200 has better results:

FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True

Tested with pytest test/test_backward.py -v -k 1.2

groenenboomj

Looks mostly good. What is the new library size?

.gitmodules

groenenboomj · 2024-07-25T15:17:13Z

test/attn_torch_function.py

+                o.fill_(float('nan'))
+                return ipc_func(extargs.force_kernel_index)
+                # print(f'running attn_fwd with {extargs.force_kernel_index=}')
+            tuning_result = cpp_autotune(ExtraArguments, func,


Any plan to expose this function (ExtraArguments) or is it only for tuning?

This will be the part of public AOTriton API for possible further extensions.
For example, we could add a second hipStream in the ExtraArguments for bwd kernel, to run the dkdv and dq concurrently.

However for now cpptuning is the main use cases of the extra argument.

tritonsrc/attn_torch_function.py

tritonsrc/bwd_split_kernel.py

xinyazhang · 2024-07-25T15:32:22Z

Looks mostly good. What is the new library size?

I don't have the all architecture+no zstd version size. The MI300X only+zstd size is 321M

xinyazhang added 30 commits July 15, 2024 22:40

A new Triton compiler sans CUDA support.

8d5ba3f

Fix the compiler for new Triton

003b06e

Mitigate compiler bug (ROCm/triton#596)

cd9615c

add wheel as another required package.

77fa1e1

Port to performant kernel and moving away from block pointers for tl.…

579bf82

…store

Fix the off_h_k computation.

2a6872a

The old pattern triggers a compiler bug

Fix writing to encoded_softmax

c62c904

Submit the Triton kernel as we are testing. All UTs passed

8be265d

remove debugging output

e150f54

v2src/flash/attn_fwd: add missing num_head_q and num_head_k

36ca9fe

Flash API now returns selected psels and copts to extra arguments, if…

63cc8fb

… AOTRITON_BUILD_FOR_TUNING=1

Implement tune_flash with AOT kernels

fdd14e1

Fix the dropout_mask and add a progressbar to test/tune_flash.py

0e988f9

Save memory for long seq length

c4c201f

Update the tuning database for MI200 only GPUs

8528c7a

Remove seqlen_q/k >= 32k rows from the database

1d4fbd0

Fix CMakeLists. Do not pass empty string as cmd argument if GENERATE_…

dd6a26b

…OPTION is empty

Return hipErrorSharedObjectSymbolNotFound for untuned cases.

98c404c

This is to address is_causal=true while seqlen_q != seqlen_k cases. For these cases the inputs are not supported and this solution makes sense. However for more complicated cases interpolation will be needed.

Fix test/test_backward.py

ec12934

Fix AUTOTUNE_KEYS for backward kernels.

4513dfe

tritonsrc: add type annotation 'i32' to num_seqlens, and fix varlen h…

350b6bb

…andling accordingly. Without type annotation the compiler behaves differently in JIT mode and then bugs will slip into AOT mode.

fix the assignment of .num_head_q/k

ece99b8

Add Navi 31/32 compiler options.

b3f9dab

Note the database has not been updated yet.

Fix various problems and now most fwd kernel tests passed.

d33cf43

The only things left are bias with very irregular shapes (e.g., False-1.2-dtype0-0.0-4-2048-32-1-1). Maybe due to tuning database extrapolation problems.

Various fixes to tune_flash

ad33017

Skip unaccepted inputs (notably causual=True has lots requirements) Add more sequence lengths

Make zstd quite

09583e2

Add draft document 'How To Generate Tuning Database.md'

87262c1

doc -> docs

f95d878

Debugging output in bwd kernel

a5a3189

xinyazhang added 17 commits July 15, 2024 22:40

Reduce the tuning time since there are too many cases to test...

22c3197

cpp autotune: x2 num_warps if warp_size == 32

0b40af3

Navi32: skip autotune configs that takes too long to build

7457a5a

Add --use_multigpu to test/tune_flash.py for multi-GPU tuning

b7d647c

test/tune_flash.py: actually distribute tensor/computing to different…

e819160

… GPUs. Confirmed with amd-smi -w 2

Move dev-only packages from requirements.txt into requirements-dev.txt

b8c702d

Mainly autotune-related scripts

tune_flash.py: Fix the slow splice_pipes

b5869a1

Fix single GPU script.

c4f0de5

Move database accessing to a separate process, and unify the task gen…

8558d73

…erator Now the whole pipeline is: TunerManager -(mp.Q)-> [Worker] * N -(mp.Q)-> DbAccessor -(stdio)-> table_tool.py

tune_flash: add --json_file, improve --dry_run to report total numbers,

18b56c0

and be less verbose.

tune_flash: Move the testing to a separate process to avoid segfault.

fcfa3e8

However this seriously reduced the tuning process. Thinking of alternatives..

Cache the minesweeping process to avoid creating processes repeatedly

f7b1f28

Performance restored to normal with this trick.

Remove 16k from seqlen_q/k, record task id and skipped tests in json

a150716

table_tool: skip result=skipped json objects

7d224cd

tuning_database: Update FLASH$attn_fwd for gfx90a and gfx942

6ef9a40

track aotriton-hyperjump branch in third_party/triton

e92f7d1

xinyazhang requested review from groenenboomj, xiaohuguo2023 and jeffdaily July 15, 2024 23:14

xinyazhang added 2 commits July 15, 2024 23:29

Fix test/performance_forward.py

7ed9ac6

Remove old_compile.py

bb1a5e8

xinyazhang mentioned this pull request Jul 24, 2024

[Perf] Is it possible that the kernels wrapped with AOT have the similar performance comparing with the original ones? #37

Open

groenenboomj reviewed Jul 25, 2024

View reviewed changes

groenenboomj approved these changes Jul 26, 2024

View reviewed changes

xinyazhang merged commit 85d120c into main Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to upstream Triton compiler, and related changes #36

Switch to upstream Triton compiler, and related changes #36

xinyazhang commented Jul 15, 2024 •

edited

Loading

xinyazhang commented Jul 15, 2024 •

edited

Loading

xinyazhang commented Jul 15, 2024

groenenboomj left a comment

groenenboomj Jul 25, 2024

xinyazhang Jul 25, 2024 •

edited

Loading

xinyazhang commented Jul 25, 2024

Switch to upstream Triton compiler, and related changes #36

Switch to upstream Triton compiler, and related changes #36

Conversation

xinyazhang commented Jul 15, 2024 • edited Loading

xinyazhang commented Jul 15, 2024 • edited Loading

xinyazhang commented Jul 15, 2024

groenenboomj left a comment

Choose a reason for hiding this comment

groenenboomj Jul 25, 2024

Choose a reason for hiding this comment

xinyazhang Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

xinyazhang commented Jul 25, 2024

xinyazhang commented Jul 15, 2024 •

edited

Loading

xinyazhang commented Jul 15, 2024 •

edited

Loading

xinyazhang Jul 25, 2024 •

edited

Loading