Skip to content

Releases: ROCm/aotriton

AOTriton 0.9.2 Beta

05 Mar 22:21
Compare
Choose a tag to compare

What's Changed from Release 0.8

  • Fix a linker script problem in 0.9b which did not import the default linker script.
  • Fix the version number problem in 0.9.1b which still uses 0.9.0 and causes potential confusions.

What's Changed from Release 0.8

  • Initial support gfx950 by @xinyazhang in #64
    • Note: gfx950 support is fully experimental, not built by default, and not shipped in the release binary packages unless explicitly stated in the package name
  • Non Power of Two (NPOT) head dimension Optimization by @BinDinAMD and @xinyazhang in #66
    • Newly added optimized NPOT head dimensions: 48, 80, 96, 160, 192, 224
    • Note: older AOTriton does support these head dimensions, but previously inputs with these head dimensions are padded loaded/accumulated to power of two in-register tensors, cause performance issues.
  • Port 20250128 main perf kernel by @xinyazhang in #70
  • Use Philox64x4 PRNG, and remove RETURN_ENCODED_SOFTMAX=True variant from compiled forward kernel by @xinyazhang in #71
    • Internally the Philox64x4 PRNG does not convert the output to fp32 anymore, instead it views the i64x4 outputs as i32x8, and compare it with idropout_p which is converted from fp32 dropout_p to i32.
    • debug_fill_dropout_rng and debug_fill_dropout_rng_tensor are deprecated and will be removed in 0.10, since they still use Philox32x1 PRNG, and their outputs are not matching the actual PRNG values used in the dropout process.
    • debug_simulate_encoded_softmax is the new API to get the outputs of the Philox64x4 PRNG. It will write 0.5 if the PRNG is greater than the dropout_p, and -0.5 otherwise.
  • Add "AOTriton .." string to .comment section of libaotriton_v2.so by @xinyazhang in #74
    • Now you can verify the precise AOTriton version with readelf -p .comment libaotriton_v2.so. An example output is:
[     0]  GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[    2c]  AOTriton 0.9.0
  • Enable Persistent Dynamic for Causal if input is not varlen by @xinyazhang in #73
    • For v2::flash::attn_fwd API, atomic_for_causal must be a one-element GPU tensor with zero value if is_causal is true.
  • Fused BWD Kernel by @BinDinAMD in #69
    • No support for hdim > 256 due to register pressure.
    • Empirically, this kernel outperforms the split kernel when hdim*seqlen <= 64 * 512.
    • When use fused bwd kernel, the delta tensor is not needed.
  • Misc changes and performance tuning for 0.9b release by @xinyazhang in #76
    • Initial RX 9070XT support
    • The softmax_lse tensor now is optional in forward kernel. For inference only process this tensor is not needed.

Known Problems

  • There are unidentified memory alignment requirements to input/output tensors. If possible, please pad the input tensor shapes to multiple of 8 for safety (except for the batch dimension).
  • FA kernel built for 9070XT and gfx950 may cause GPU segfaults on certain unidentified conditions.

Full Changelog: 0.8b...0.9.2b

AOTriton 0.9.1 Beta

05 Mar 01:42
Compare
Choose a tag to compare
  • 0.9b has a problem in the linker script. This point release is a hotfix for this problem. Users should use 0.9.1b instead, due to potential problems of 0.9b.
    • The .comment section still uses 0.9 rather 0.9.1 due to the emergency of the fix.

Full Changelog: 0.9b...0.9.1b

AOTriton 0.9 Beta

04 Mar 08:04
f539cf9
Compare
Choose a tag to compare

There is a bug in the linker script, users should use the upcoming 0.9.1b release instead

The release notes will be moved there as well.

Full Changelog: 0.8b...0.9b

AOTriton 0.8.2 Beta

23 Jan 17:58
b24f43a
Compare
Choose a tag to compare

0.8.2b is an emergency fix

We received reports there is a bug causing NaN during fine-tuning llama3.2-11b vision model. Anyone who uses 0.8 is recommended to upgrade to 0.8.2b.

Note: AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases compared with 0.8b

This is a point release

0.8.2b can be used as drop-in replacement of 0.7b shared object file.

What's Changed

  • Fix missing batch_index when calculating bias pointer by @xinyazhang in #68

Full Changelog: 0.8.1b...0.8.2b

AOTriton 0.8.1 Beta

14 Jan 16:54
3a80554
Compare
Choose a tag to compare

What's Changed

Note: this not recommended unless you need to support head dimension 512 immediately.

Full Changelog: 0.8b...0.8.1b

AOTriton 0.8 Beta

26 Nov 19:42
6f8cbca
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.7b...0.8b

AOTriton 0.7.3 Beta

20 Nov 00:56
Compare
Choose a tag to compare

0.7.3b is an emergency fix for 0.7.2b

0.7.2b has been removed from release due to a correctness bug.

This is a point release

0.7.3b can be used as drop-in replacement of 0.7b or 0.7.xb shared object file.

What's Changed (Compared with 0.7.1b)

  • Fix varlen related implementation errors
  • Fix NaN output when sm_scale=0.0, which is introduced #45
  • Fix NaN output for large numerical errors. See #54 for more details.
    • The fix in 0.7.2b introduced a bug. 0.7.3b is released to revise this fix.

Note the two fixes for NaN may have some performance impact.

Full Changelog: 0.7.1b...0.7.3b

(DO NOT USE) AOTriton 0.7.2 Beta

08 Nov 20:32
Compare
Choose a tag to compare

CAVEAT: DO NOT USE THIS RELEASE

Commit 14d673f introduced a bug.
We are going to release 0.7.3 instead for a fix.

The binary tarballs are deleted to prevent accidental usages.

This is a point release

0.7.2b can be used as drop-in replacement of 0.7b or 0.7.1b shared object file.

What's Changed

  • Fix varlen related implementation errors
  • Fix NaN output when sm_scale=0.0, which is introduced #45
  • Fix NaN output for large numerical errors. See #54 for more details.

Note the two fixes for NaN may have some performance impact.

Full Changelog: 0.7.1b...0.7.2b

AOTriton 0.7.1 Beta

04 Oct 21:50
f6b28a9
Compare
Choose a tag to compare

This is a point release

0.7.1b can be used as drop-in replacement of 0.7b shared object file.

What's Changed

Full Changelog: 0.7b...0.7.1b

AOTriton 0.7 Beta

23 Aug 16:19
9be0406
Compare
Choose a tag to compare

What's Changed

  • Default to Shared Object by @jithunnair-amd in #33
  • Add varlen support to AOTriton's Flash Attention by @xinyazhang in #31
  • Switch to upstream Triton compiler, and related changes by @xinyazhang in #36
  • Improve Backward Performance and Experimental Navi31 Support by @xinyazhang in #39
    • Introduce new tuning system based on pre-compiled GPU kernels
    • Navi 31's support is still experimental
  • Support hipGraph usage in PyTorch by @xinyazhang in #40
    • This changes the RNG API used by FA kernels.
    • Switch to new testing scheme to match PyTorch 2.5's changes

New Contributors

Full Changelog: 0.6b...0.7b