Releases: ROCm/aotriton
AOTriton 0.9.2 Beta
What's Changed from Release 0.8
- Fix a linker script problem in 0.9b which did not import the default linker script.
- Fix the version number problem in 0.9.1b which still uses 0.9.0 and causes potential confusions.
What's Changed from Release 0.8
- Initial support gfx950 by @xinyazhang in #64
- Note: gfx950 support is fully experimental, not built by default, and not shipped in the release binary packages unless explicitly stated in the package name
- Non Power of Two (NPOT) head dimension Optimization by @BinDinAMD and @xinyazhang in #66
- Newly added optimized NPOT head dimensions: 48, 80, 96, 160, 192, 224
- Note: older AOTriton does support these head dimensions, but previously inputs with these head dimensions are padded loaded/accumulated to power of two in-register tensors, cause performance issues.
- Port 20250128 main perf kernel by @xinyazhang in #70
- Use Philox64x4 PRNG, and remove RETURN_ENCODED_SOFTMAX=True variant from compiled forward kernel by @xinyazhang in #71
- Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with
idropout_p
which is converted from fp32dropout_p
to i32. debug_fill_dropout_rng
anddebug_fill_dropout_rng_tensor
are deprecated and will be removed in 0.10, since they still use Philox32x1 PRNG, and their outputs are not matching the actual PRNG values used in the dropout process.debug_simulate_encoded_softmax
is the new API to get the outputs of the Philox64x4 PRNG. It will write 0.5 if the PRNG is greater than the dropout_p, and -0.5 otherwise.
- Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with
- Add "AOTriton .." string to .comment section of libaotriton_v2.so by @xinyazhang in #74
- Now you can verify the precise AOTriton version with
readelf -p .comment libaotriton_v2.so
. An example output is:
- Now you can verify the precise AOTriton version with
[ 0] GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[ 2c] AOTriton 0.9.0
- Enable Persistent Dynamic for Causal if input is not varlen by @xinyazhang in #73
- For
v2::flash::attn_fwd
API,atomic_for_causal
must be a one-element GPU tensor with zero value ifis_causal
istrue
.
- For
- Fused BWD Kernel by @BinDinAMD in #69
- No support for hdim > 256 due to register pressure.
- Empirically, this kernel outperforms the split kernel when hdim*seqlen <= 64 * 512.
- When use fused bwd kernel, the
delta
tensor is not needed.
- Misc changes and performance tuning for 0.9b release by @xinyazhang in #76
- Initial RX 9070XT support
- The
softmax_lse
tensor now is optional in forward kernel. For inference only process this tensor is not needed.
Known Problems
- There are unidentified memory alignment requirements to input/output tensors. If possible, please pad the input tensor shapes to multiple of 8 for safety (except for the batch dimension).
- FA kernel built for 9070XT and gfx950 may cause GPU segfaults on certain unidentified conditions.
Full Changelog: 0.8b...0.9.2b
AOTriton 0.9.1 Beta
- 0.9b has a problem in the linker script. This point release is a hotfix for this problem. Users should use 0.9.1b instead, due to potential problems of 0.9b.
- The
.comment
section still uses 0.9 rather 0.9.1 due to the emergency of the fix.
- The
Full Changelog: 0.9b...0.9.1b
AOTriton 0.9 Beta
There is a bug in the linker script, users should use the upcoming 0.9.1b release instead
The release notes will be moved there as well.
Full Changelog: 0.8b...0.9b
AOTriton 0.8.2 Beta
0.8.2b is an emergency fix
We received reports there is a bug causing NaN during fine-tuning llama3.2-11b vision model. Anyone who uses 0.8 is recommended to upgrade to 0.8.2b.
Note: AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases compared with 0.8b
This is a point release
0.8.2b can be used as drop-in replacement of 0.7b shared object file.
What's Changed
- Fix missing batch_index when calculating bias pointer by @xinyazhang in #68
Full Changelog: 0.8.1b...0.8.2b
AOTriton 0.8.1 Beta
What's Changed
- Support Head Dimension 512 by @xinyazhang in #67
Note: this not recommended unless you need to support head dimension 512 immediately.
Full Changelog: 0.8b...0.8.1b
AOTriton 0.8 Beta
What's Changed
- Add PyTorch compatibility matrix to README.md by @xinyazhang in #41
- Add cmake option AOTRITON_NAME_SUFFIX to resolve name conflicts by @xinyazhang in #42
- Merge improvements of 0.7.1b release into main by @xinyazhang in #46
- Code Clean Up by @xinyazhang in #48
- GQA Support by @xinyazhang in #49
- Kernel Storage V2 by @xinyazhang in #50
- Add docker based package builder and switch to system compiler by @xinyazhang in #51
- Add versioning support in multiple levels. by @xinyazhang in #53
- Restore the support of causal=True and seqlen_q != seqlen_k by @xinyazhang in #55
- Misc changes and performance tuning for 0.8b release by @xinyazhang in #57
Full Changelog: 0.7b...0.8b
AOTriton 0.7.3 Beta
0.7.3b is an emergency fix for 0.7.2b
0.7.2b has been removed from release due to a correctness bug.
This is a point release
0.7.3b can be used as drop-in replacement of 0.7b or 0.7.xb shared object file.
What's Changed (Compared with 0.7.1b)
- Fix varlen related implementation errors
- Fix NaN output when
sm_scale=0.0
, which is introduced #45 - Fix NaN output for large numerical errors. See #54 for more details.
- The fix in 0.7.2b introduced a bug. 0.7.3b is released to revise this fix.
Note the two fixes for NaN may have some performance impact.
Full Changelog: 0.7.1b...0.7.3b
(DO NOT USE) AOTriton 0.7.2 Beta
CAVEAT: DO NOT USE THIS RELEASE
Commit 14d673f introduced a bug.
We are going to release 0.7.3 instead for a fix.
The binary tarballs are deleted to prevent accidental usages.
This is a point release
0.7.2b can be used as drop-in replacement of 0.7b or 0.7.1b shared object file.
What's Changed
Fix varlen related implementation errorsFix NaN output whensm_scale=0.0
, which is introduced #45Fix NaN output for large numerical errors. See #54 for more details.
Note the two fixes for NaN may have some performance impact.
Full Changelog: 0.7.1b...0.7.2b
AOTriton 0.7.1 Beta
This is a point release
0.7.1b can be used as drop-in replacement of 0.7b shared object file.
What's Changed
- Ignore colon suffixes in gcnArchName by @xinyazhang in #44
- FA Kernel Update for Accuracy and Performance by @xinyazhang in #45
Full Changelog: 0.7b...0.7.1b
AOTriton 0.7 Beta
What's Changed
- Default to Shared Object by @jithunnair-amd in #33
- Add varlen support to AOTriton's Flash Attention by @xinyazhang in #31
- Switch to upstream Triton compiler, and related changes by @xinyazhang in #36
- Improve Backward Performance and Experimental Navi31 Support by @xinyazhang in #39
- Introduce new tuning system based on pre-compiled GPU kernels
- Navi 31's support is still experimental
- Support hipGraph usage in PyTorch by @xinyazhang in #40
- This changes the RNG API used by FA kernels.
- Switch to new testing scheme to match PyTorch 2.5's changes
New Contributors
- @jithunnair-amd made their first contribution in #33
Full Changelog: 0.6b...0.7b