Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM v0.13 Update #2269

Merged
merged 2 commits into from
Sep 30, 2024
Merged

TensorRT-LLM v0.13 Update #2269

merged 2 commits into from
Sep 30, 2024

Conversation

Shixiaowei02
Copy link
Collaborator

@Shixiaowei02 Shixiaowei02 commented Sep 30, 2024

TensorRT-LLM Release 0.13.0

Key Features and Enhancements

  • Supported lookahead decoding (experimental), see docs/source/speculative_decoding.md.
  • Added some enhancements to the ModelWeightsLoader (a unified checkpoint converter, see docs/source/architecture/model-weights-loader.md).
    • Supported Qwen models.
    • Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
    • Improved performance on *.bin and *.pth.
  • Supported OpenAI Whisper in C++ runtime.
  • Added some enhancements to the LLM class.
    • Supported LoRA.
    • Supported engine building using dummy weights.
    • Supported trust_remote_code for customized models and tokenizers downloaded from Hugging Face Hub.
  • Supported beam search for streaming mode.
  • Supported tensor parallelism for Mamba2.
  • Supported returning generation logits for streaming mode.
  • Added curand and bfloat16 support for ReDrafter.
  • Added sparse mixer normalization mode for MoE models.
  • Added support for QKV scaling in FP8 FMHA.
  • Supported FP8 for MoE LoRA.
  • Supported KV cache reuse for P-Tuning and LoRA.
  • Supported in-flight batching for CogVLM models.
  • Supported LoRA for the ModelRunnerCpp class.
  • Supported head_size=48 cases for FMHA kernels.
  • Added FP8 examples for DiT models, see examples/dit/README.md.
  • Supported decoder with encoder input features for the C++ executor API.

API Changes

  • [BREAKING CHANGE] Set use_fused_mlp to True by default.
  • [BREAKING CHANGE] Enabled multi_block_mode by default.
  • [BREAKING CHANGE] Enabled strongly_typed by default in builder API.
  • [BREAKING CHANGE] Renamed maxNewTokens, randomSeed and minLength to maxTokens, seed and minTokens following OpenAI style.
  • The LLM class
    • [BREAKING CHANGE] Updated LLM.generate arguments to include PromptInputs and tqdm.
  • The C++ executor API
    • [BREAKING CHANGE] Added LogitsPostProcessorConfig.
    • Added FinishReason to Result.

Model Updates

  • Supported Gemma 2, see "Run Gemma 2" section in examples/gemma/README.md.

Fixed Issues

Infrastructure Changes

  • Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
  • Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
  • The dependent TensorRT version is updated to 10.4.0.
  • The dependent CUDA version is updated to 12.5.1.
  • The dependent PyTorch version is updated to 2.4.0.
  • The dependent ModelOpt version is updated to v0.15.

Copy link
Collaborator

@byshiue byshiue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants