05 Mar 22:21

b388d22

AOTriton 0.9.2 Beta

What's Changed from Release 0.8

Fix a linker script problem in 0.9b which did not import the default linker script.
Fix the version number problem in 0.9.1b which still uses 0.9.0 and causes potential confusions.

What's Changed from Release 0.8

Initial support gfx950 by @xinyazhang in #64
- Note: gfx950 support is fully experimental, not built by default, and not shipped in the release binary packages unless explicitly stated in the package name
Non Power of Two (NPOT) head dimension Optimization by @BinDinAMD and @xinyazhang in #66
- Newly added optimized NPOT head dimensions: 48, 80, 96, 160, 192, 224
- Note: older AOTriton does support these head dimensions, but previously inputs with these head dimensions are padded loaded/accumulated to power of two in-register tensors, cause performance issues.
Port 20250128 main perf kernel by @xinyazhang in #70
Use Philox64x4 PRNG, and remove RETURN_ENCODED_SOFTMAX=True variant from compiled forward kernel by @xinyazhang in #71
- Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with idropout_p which is converted from fp32 dropout_p to i32.
- debug_fill_dropout_rng and debug_fill_dropout_rng_tensor are deprecated and will be removed in 0.10, since they still use Philox32x1 PRNG, and their outputs are not matching the actual PRNG values used in the dropout process.
- debug_simulate_encoded_softmax is the new API to get the outputs of the Philox64x4 PRNG. It will write 0.5 if the PRNG is greater than the dropout_p, and -0.5 otherwise.
Add "AOTriton .." string to .comment section of libaotriton_v2.so by @xinyazhang in #74
- Now you can verify the precise AOTriton version with readelf -p .comment libaotriton_v2.so. An example output is:

[     0]  GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[    2c]  AOTriton 0.9.0

Enable Persistent Dynamic for Causal if input is not varlen by @xinyazhang in #73
- For v2::flash::attn_fwd API, atomic_for_causal must be a one-element GPU tensor with zero value if is_causal is true.
Fused BWD Kernel by @BinDinAMD in #69
- No support for hdim > 256 due to register pressure.
- Empirically, this kernel outperforms the split kernel when hdim*seqlen <= 64 * 512.
- When use fused bwd kernel, the delta tensor is not needed.
Misc changes and performance tuning for 0.9b release by @xinyazhang in #76
- Initial RX 9070XT support
- The softmax_lse tensor now is optional in forward kernel. For inference only process this tensor is not needed.

Known Problems

There are unidentified memory alignment requirements to input/output tensors. If possible, please pad the input tensor shapes to multiple of 8 for safety (except for the batch dimension).
FA kernel built for 9070XT and gfx950 may cause GPU segfaults on certain unidentified conditions.

Full Changelog: 0.8b...0.9.2b

Contributors

xinyazhang and BinDinAMD

Assets 5

05 Mar 01:42

xinyazhang

0.9.1b

6f72f69

AOTriton 0.9.1 Beta Latest

Latest

0.9b has a problem in the linker script. This point release is a hotfix for this problem. Users should use 0.9.1b instead, due to potential problems of 0.9b.
- The .comment section still uses 0.9 rather 0.9.1 due to the emergency of the fix.

Full Changelog: 0.9b...0.9.1b

Assets 5

04 Mar 08:04

xinyazhang

0.9b

f539cf9

AOTriton 0.9 Beta

There is a bug in the linker script, users should use the upcoming 0.9.1b release instead

The release notes will be moved there as well.

Full Changelog: 0.8b...0.9b

Assets 2

23 Jan 17:58

xinyazhang

0.8.2b

b24f43a

AOTriton 0.8.2 Beta

0.8.2b is an emergency fix

We received reports there is a bug causing NaN during fine-tuning llama3.2-11b vision model. Anyone who uses 0.8 is recommended to upgrade to 0.8.2b.

Note: AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases compared with 0.8b

This is a point release

0.8.2b can be used as drop-in replacement of 0.7b shared object file.

What's Changed

Fix missing batch_index when calculating bias pointer by @xinyazhang in #68

Full Changelog: 0.8.1b...0.8.2b

Contributors

xinyazhang

Assets 4

14 Jan 16:54

xinyazhang

0.8.1b

3a80554

AOTriton 0.8.1 Beta

What's Changed

Support Head Dimension 512 by @xinyazhang in #67

Note: this not recommended unless you need to support head dimension 512 immediately.

Full Changelog: 0.8b...0.8.1b

Contributors

xinyazhang

Assets 4

26 Nov 19:42

xinyazhang

0.8b

6f8cbca

AOTriton 0.8 Beta

What's Changed

Add PyTorch compatibility matrix to README.md by @xinyazhang in #41
Add cmake option AOTRITON_NAME_SUFFIX to resolve name conflicts by @xinyazhang in #42
Merge improvements of 0.7.1b release into main by @xinyazhang in #46
Code Clean Up by @xinyazhang in #48
GQA Support by @xinyazhang in #49
Kernel Storage V2 by @xinyazhang in #50
Add docker based package builder and switch to system compiler by @xinyazhang in #51
Add versioning support in multiple levels. by @xinyazhang in #53
Restore the support of causal=True and seqlen_q != seqlen_k by @xinyazhang in #55
Misc changes and performance tuning for 0.8b release by @xinyazhang in #57

Full Changelog: 0.7b...0.8b

Contributors

xinyazhang

Assets 4

20 Nov 00:56

xinyazhang

0.7.3b

cfe13e6

AOTriton 0.7.3 Beta

0.7.3b is an emergency fix for 0.7.2b

0.7.2b has been removed from release due to a correctness bug.

This is a point release

0.7.3b can be used as drop-in replacement of 0.7b or 0.7.xb shared object file.

What's Changed (Compared with 0.7.1b)

Fix varlen related implementation errors
Fix NaN output when sm_scale=0.0, which is introduced #45
Fix NaN output for large numerical errors. See #54 for more details.
- The fix in 0.7.2b introduced a bug. 0.7.3b is released to revise this fix.

Note the two fixes for NaN may have some performance impact.

Full Changelog: 0.7.1b...0.7.3b

Assets 3

08 Nov 20:32

xinyazhang

0.7.2b

14d673f

(DO NOT USE) AOTriton 0.7.2 Beta

CAVEAT: DO NOT USE THIS RELEASE

Commit 14d673f introduced a bug.
We are going to release 0.7.3 instead for a fix.

The binary tarballs are deleted to prevent accidental usages.

This is a point release

~~0.7.2b can be used as drop-in replacement of 0.7b or 0.7.1b shared object file.~~

What's Changed

~~Fix varlen related implementation errors~~
~~Fix NaN output when sm_scale=0.0, which is introduced #45~~
~~Fix NaN output for large numerical errors. See #54 for more details.~~

~~Note the two fixes for NaN may have some performance impact.~~

Full Changelog: 0.7.1b...0.7.2b

Assets 2

04 Oct 21:50

xinyazhang

0.7.1b

f6b28a9

AOTriton 0.7.1 Beta

This is a point release

0.7.1b can be used as drop-in replacement of 0.7b shared object file.

What's Changed

Ignore colon suffixes in gcnArchName by @xinyazhang in #44
FA Kernel Update for Accuracy and Performance by @xinyazhang in #45

Full Changelog: 0.7b...0.7.1b

Contributors

xinyazhang

Assets 6

23 Aug 16:19

xinyazhang

0.7b

9be0406

AOTriton 0.7 Beta

What's Changed

Default to Shared Object by @jithunnair-amd in #33
Add varlen support to AOTriton's Flash Attention by @xinyazhang in #31
Switch to upstream Triton compiler, and related changes by @xinyazhang in #36
Improve Backward Performance and Experimental Navi31 Support by @xinyazhang in #39
- Introduce new tuning system based on pre-compiled GPU kernels
- Navi 31's support is still experimental
Support hipGraph usage in PyTorch by @xinyazhang in #40
- This changes the RNG API used by FA kernels.
- Switch to new testing scheme to match PyTorch 2.5's changes

New Contributors

@jithunnair-amd made their first contribution in #33

Full Changelog: 0.6b...0.7b

Contributors

xinyazhang and jithunnair-amd

Assets 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed from Release 0.8

What's Changed from Release 0.8

Known Problems

Contributors

There is a bug in the linker script, users should use the upcoming 0.9.1b release instead

0.8.2b is an emergency fix

This is a point release

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

0.7.3b is an emergency fix for 0.7.2b

This is a point release

What's Changed (Compared with 0.7.1b)

CAVEAT: DO NOT USE THIS RELEASE

This is a point release

What's Changed

This is a point release

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Releases: ROCm/aotriton

AOTriton 0.9.2 Beta

What's Changed from Release 0.8

What's Changed from Release 0.8

Known Problems

Contributors

AOTriton 0.9.1 Beta

AOTriton 0.9 Beta

There is a bug in the linker script, users should use the upcoming 0.9.1b release instead

AOTriton 0.8.2 Beta

0.8.2b is an emergency fix

This is a point release

What's Changed

Contributors

AOTriton 0.8.1 Beta

What's Changed

Contributors

AOTriton 0.8 Beta

What's Changed

Contributors

AOTriton 0.7.3 Beta

0.7.3b is an emergency fix for 0.7.2b

This is a point release

What's Changed (Compared with 0.7.1b)

(DO NOT USE) AOTriton 0.7.2 Beta

CAVEAT: DO NOT USE THIS RELEASE

This is a point release

What's Changed

AOTriton 0.7.1 Beta

This is a point release

What's Changed

Contributors

AOTriton 0.7 Beta

What's Changed

New Contributors

Contributors