TensorRT-LLM v0.13 Update #2269

Shixiaowei02 · 2024-09-30T07:54:43Z

TensorRT-LLM Release 0.13.0

Key Features and Enhancements

Supported lookahead decoding (experimental), see docs/source/speculative_decoding.md.
Added some enhancements to the ModelWeightsLoader (a unified checkpoint converter, see docs/source/architecture/model-weights-loader.md).
- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on *.bin and *.pth.
Supported OpenAI Whisper in C++ runtime.
Added some enhancements to the LLM class.
- Supported LoRA.
- Supported engine building using dummy weights.
- Supported trust_remote_code for customized models and tokenizers downloaded from Hugging Face Hub.
Supported beam search for streaming mode.
Supported tensor parallelism for Mamba2.
Supported returning generation logits for streaming mode.
Added curand and bfloat16 support for ReDrafter.
Added sparse mixer normalization mode for MoE models.
Added support for QKV scaling in FP8 FMHA.
Supported FP8 for MoE LoRA.
Supported KV cache reuse for P-Tuning and LoRA.
Supported in-flight batching for CogVLM models.
Supported LoRA for the ModelRunnerCpp class.
Supported head_size=48 cases for FMHA kernels.
Added FP8 examples for DiT models, see examples/dit/README.md.
Supported decoder with encoder input features for the C++ executor API.

API Changes

[BREAKING CHANGE] Set use_fused_mlp to True by default.
[BREAKING CHANGE] Enabled multi_block_mode by default.
[BREAKING CHANGE] Enabled strongly_typed by default in builder API.
[BREAKING CHANGE] Renamed maxNewTokens, randomSeed and minLength to maxTokens, seed and minTokens following OpenAI style.
The LLM class
- [BREAKING CHANGE] Updated LLM.generate arguments to include PromptInputs and tqdm.
The C++ executor API
- [BREAKING CHANGE] Added LogitsPostProcessorConfig.
- Added FinishReason to Result.

Model Updates

Supported Gemma 2, see "Run Gemma 2" section in examples/gemma/README.md.

Fixed Issues

Fixed an accuracy issue when enabling remove padding issue for cross attention. (T5 model, large difference in results when remove_input_padding is enabled #1999)
Fixed the failure in converting qwen2-0.5b-instruct when using smoothquant. (convert qwen2-0.5b-instruct failed when using smoothquant #2087)
Matched the exclude_modules pattern in convert_utils.py to the changes in quantize.py. ([Fix] Match exclude_modules pattern in convert_utils.py to quantize.py changes. #2113)
Fixed build engine error when FORCE_NCCL_ALL_REDUCE_STRATEGY is set.
Fixed unexpected truncation in the quant mode of gpt_attention.
Fixed the hang caused by race condition when canceling requests.
Fixed the default factory for LoraConfig. (in python 3.11 and release 0.8.0: ValueError: mutable default <class 'datasets.utils.version.Version'> for field version is not allowed: use default_factory #1323)

Infrastructure Changes

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
The dependent TensorRT version is updated to 10.4.0.
The dependent CUDA version is updated to 12.5.1.
The dependent PyTorch version is updated to 2.4.0.
The dependent ModelOpt version is updated to v0.15.

byshiue

LGTM

open source 867c9da34e774ba1b0cdcd0a7d153a687b8e8dc6

c9f67c5

byshiue approved these changes Sep 30, 2024

View reviewed changes

update the documentation

3f83b0c

Shixiaowei02 merged commit 201135e into rel Sep 30, 2024

Shixiaowei02 deleted the preview/rel branch September 30, 2024 08:20

MahmoudAshraf97 mentioned this pull request Oct 14, 2024

tensorrt_llm.bindings.Request class is not usable for non-text inputs #1941

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM v0.13 Update #2269

TensorRT-LLM v0.13 Update #2269

Shixiaowei02 commented Sep 30, 2024 •

edited

Loading

byshiue left a comment

TensorRT-LLM v0.13 Update #2269

TensorRT-LLM v0.13 Update #2269

Conversation

Shixiaowei02 commented Sep 30, 2024 • edited Loading

TensorRT-LLM Release 0.13.0

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

byshiue left a comment

Choose a reason for hiding this comment

Shixiaowei02 commented Sep 30, 2024 •

edited

Loading