Port 20250128 main perf kernel #70

xinyazhang · 2025-02-04T16:33:40Z

Major Changes

[kernel] Backport the 2025/01/28 main_perf kernel
- [kernel] Support empty L tensor pointer.
- See Attempt to unify perf kernel and AOTriton's kernel. triton#716 for more details
- [shim] Adjust the build rules accordingly
[shim] Remove non-power-of-two (NPOT) head dim 72, which triggers compiler bugs on bf16
[db] Remove attn_fwd table from tuning database, since the old entries are not valid anymore.
[db] Set all entries to num_stages=1 since num_stages=2 constantly trigger compiler bugs
[test] Add new head dimensions. Now categories into three groups
- Power-of-two head dimensions
- Optimized NPOT head dimensions
- Prime number head dimensions to cover all gaps b/w neighboring POT+NPOT head dims.
[shim] Add env var AOTRITON_SKIP_LUT_CHECK to skip LUT sanity check on certain kernels
- As of this PR, AOTriton must be built with AOTRITON_SKIP_LUT_CHECK=flash.attn_fwd ninja install

Minor Changes

[build] Bump the version number to 0.9.0. (Should be done at the beginning of 0.9 dev)
[API] In the API, move bias tensor to the position immediately after v tensor, matching the kernel argument order
[shim] Add TensorView<0>::get_null_tensor
[test] Change AttentionExtraArgs from namedtuple to dataclass for easier-to-read default values.
[mptune] Change output json format to match kernel argument changes.
[test] Use cpu reference when seqlen_k == 579 (used by test_gqa tests). GPU reference triggers segfault.
[test] Change default value_fudge_factor to 36.0 (Should be 40.0 if considering GQA tests)
[shim] Fix the code path when the tuning database is not available

Know Problems

Tuning database for flash.attn_fwd kernel is cleared and no plan to re-build it ATM due to immediate additional changes to the forward kernel.

pytest tritonsrc/test_backward -k '1.2 and 4-4]' passed Only tested POT heads, NPOT optimization will be added later

… all UTs.

persistant dynamic tests passed

Has NaN problems for head dim [68,72) when running on MI200 + bf16

…AMIC According to ROCm/triton#716 this can restore the perf to ~400T

…ment)

Notable changes for users 1. attn_fwd is untuned for now (no point of using the old database due to outdated arguments) 2. Only one combination of perf/copts is built for untuned kernels. Previously it depends on the there is only one combination of PERF_CHOICES perf in KernelDescription. Now we can list all possible options there and leaving the first choice default. 3. Replace num_stages=2 with =1 in tuning_database for all remaining kernels. num_stages=2 triggers compiler bugs: "error: operation scheduled before its operands"

…ugs.

… of types. Also explicitly disable persistent, and fix a Debug build bug in attn_fwd.

…sions

xinyazhang · 2025-02-04T17:02:29Z

All Unit tests passed when compiled with default options provided in v2python/rules/flash/attn_fwd.py
Note: must run the reference on CPU otherwise will trigger GPU segfault.

xinyazhang added 24 commits January 29, 2025 02:19

tritonsrc/flash: new forward kernel and associated changes

47652d9

pytest tritonsrc/test_backward -k '1.2 and 4-4]' passed Only tested POT heads, NPOT optimization will be added later

Rearrange arguments

2443090

PersistentType.FIXED tested.

652549b

Remove dead and debugging code. Confirmed PersistentType.FIXED passes…

768ef62

… all UTs.

tritonsrc: auto select persistant dynamic when causal = True

e2ae9a6

persistant dynamic tests passed

Add NPOT head dim support to ported main perf kernel.

cfecfe6

tritonsrc: remove head dim 72 support. padding to 80 instead.

17d304e

Has NaN problems for head dim [68,72) when running on MI200 + bf16

tritonsrc: translate PERSISTENT_TYPE to PERSISTENT and PERSISTENT_DYN…

1505662

…AMIC According to ROCm/triton#716 this can restore the perf to ~400T

tritonsrc/_common_backward: print TFTs

274550b

Fix test_large_bf16_nan_values. Should not call composed_dot_both

fa65711

Adjust fudge factors

6b53020

Mark FIXME in bwd kernel

5ff4959

tritonsrc/test_backward: Support FOR_RELEASE env var

8b9ba52

Change release version (which we should do when closing 0.8's develop…

aa78376

…ment)

tritonsrc: new kernel porting complete, rename to fwd_kernel

e94ce36

v2src: essential changes for the new kernel.

381ecd6

Remove num_stages from tuning candidates since it triggers compiler b…

25e6428

…ugs.

Fix build error.

5e49ec0

Add env var AOTRITON_SKIP_LUT_CHECK

4e96ae1

rules: Move INT8 support to FEAT since they are compiled as 0 instead…

3b98087

… of types. Also explicitly disable persistent, and fix a Debug build bug in attn_fwd.

mptune/flash: Update the json output format

28452ce

Try to avoid segfault on test_gqa

07ba8ab

test: Change AttentionExtraArgs to dataclass and add extra head dimen…

1138d67

…sions

xinyazhang marked this pull request as ready for review February 4, 2025 17:03

xinyazhang requested review from BinDinAMD, vgokhale, jeffdaily and jithunnair-amd February 4, 2025 17:03

xinyazhang requested a review from pruthvistony February 4, 2025 17:04

jeffdaily approved these changes Feb 4, 2025

View reviewed changes

xinyazhang merged commit c70eab6 into main Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port 20250128 main perf kernel #70

Port 20250128 main perf kernel #70

xinyazhang commented Feb 4, 2025 •

edited

Loading

xinyazhang commented Feb 4, 2025

Port 20250128 main perf kernel #70

Port 20250128 main perf kernel #70

Conversation

xinyazhang commented Feb 4, 2025 • edited Loading

Major Changes

Minor Changes

Know Problems

xinyazhang commented Feb 4, 2025

xinyazhang commented Feb 4, 2025 •

edited

Loading