Skip to content

Commit 5955b8a

Browse files
kaiyuxShixiaowei02
andauthored
Update TensorRT-LLM Release branch (#1192)
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
1 parent 2f169d1 commit 5955b8a

File tree

1,337 files changed

+3804632
-2009981
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,337 files changed

+3804632
-2009981
lines changed

.dockerignore

+2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
build
22
cpp/*build*
3+
cpp/cmake-*
4+
cpp/.ccache
35
cpp/tests/resources/models
46
tensorrt_llm/libs
57
**/__pycache__

.github/ISSUE_TEMPLATE/bug_report.yml

+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
name: "Bug Report"
2+
description: Submit a bug report to help us improve TensorRT-LLM
3+
labels: [ "bug" ]
4+
body:
5+
- type: textarea
6+
id: system-info
7+
attributes:
8+
label: System Info
9+
description: Please share your system info with us.
10+
placeholder: |
11+
- CPU architecture (e.g., x86_64, aarch64)
12+
- CPU/Host memory size (if known)
13+
- GPU properties
14+
- GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
15+
- GPU memory size (if known)
16+
- Clock frequencies used (if applicable)
17+
- Libraries
18+
- TensorRT-LLM branch or tag (e.g., main, v0.7.1)
19+
- TensorRT-LLM commit (if known)
20+
- Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
21+
- Container used (if running TensorRT-LLM in a container)
22+
- NVIDIA driver version
23+
- OS (Ubuntu 22.04, CentOS 7, Windows 10)
24+
- Any other information that may be useful in reproducing the bug
25+
validations:
26+
required: true
27+
28+
- type: textarea
29+
id: who-can-help
30+
attributes:
31+
label: Who can help?
32+
description: |
33+
To expedite the response to your issue, it would be helpful if you could identify the appropriate person
34+
to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
35+
36+
Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
37+
you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
38+
39+
Please tag fewer than 3 people.
40+
41+
Quantization: @Tracin
42+
43+
Documentation: @juney-nvidia
44+
45+
Feature request: @ncomly-nvidia
46+
47+
Performance: @kaiyux
48+
49+
Others: @byshiue
50+
51+
placeholder: "@Username ..."
52+
53+
- type: checkboxes
54+
id: information-scripts-examples
55+
attributes:
56+
label: Information
57+
description: 'The problem arises when using:'
58+
options:
59+
- label: "The official example scripts"
60+
- label: "My own modified scripts"
61+
62+
- type: checkboxes
63+
id: information-tasks
64+
attributes:
65+
label: Tasks
66+
description: "The tasks I am working on are:"
67+
options:
68+
- label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
69+
- label: "My own task or dataset (give details below)"
70+
71+
- type: textarea
72+
id: reproduction
73+
validations:
74+
required: true
75+
attributes:
76+
label: Reproduction
77+
description: |
78+
Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
79+
Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
80+
81+
Remember to use code tags to properly format your code. You can refer to the
82+
link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
83+
84+
Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
85+
It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
86+
87+
placeholder: |
88+
Steps to reproduce the behavior:
89+
90+
1.
91+
2.
92+
3.
93+
94+
- type: textarea
95+
id: expected-behavior
96+
validations:
97+
required: true
98+
attributes:
99+
label: Expected behavior
100+
description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."
101+
102+
- type: textarea
103+
id: actual-behavior
104+
validations:
105+
required: true
106+
attributes:
107+
label: actual behavior
108+
description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."
109+
110+
- type: textarea
111+
id: additioanl-notes
112+
validations:
113+
required: true
114+
attributes:
115+
label: additional notes
116+
description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."

.pre-commit-config.yaml

+4-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ repos:
1515
rev: v4.1.0
1616
hooks:
1717
- id: check-added-large-files
18-
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
18+
exclude: |
19+
(?x)^(.*cubin.cpp)$
1920
- id: check-merge-conflict
2021
- id: check-symlinks
2122
- id: detect-private-key
@@ -45,4 +46,5 @@ repos:
4546
args:
4647
- --skip=".git,3rdparty"
4748
- --exclude-file=examples/whisper/tokenizer.py
48-
- --ignore-words-list=rouge,inout,atleast,strat
49+
- --ignore-words-list=rouge,inout,atleast,strat,nd
50+
exclude: 'tests/llm-test-defs/turtle/test_input_files'

CHANGELOG.md

+36
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,41 @@
11
# Change Log
22

3+
## Versions 0.7.0 / 0.7.1
4+
5+
* Models
6+
- BART and mBART support in encoder-decoder models
7+
- FairSeq Neural Machine Translation (NMT) family
8+
- Mixtral-8x7B model
9+
- Support weight loading for HuggingFace Mixtral model
10+
- OpenAI Whisper
11+
- Mixture of Experts support
12+
- MPT - Int4 AWQ / SmoothQuant support
13+
- Baichuan FP8 quantization support
14+
* Features
15+
- [Preview] Speculative decoding
16+
- Add Python binding for `GptManager`
17+
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
18+
- System prompt caching
19+
- Enable split-k for weight-only cutlass kernels
20+
- FP8 KV cache support for XQA kernel
21+
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
22+
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
23+
- fMHA support for chunked attention and paged kv cache
24+
* Bug fixes
25+
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
26+
- Fix LLaMa with LoRA error #637
27+
- Fix LLaMA GPTQ failure #580
28+
- Fix Python binding for InferenceRequest issue #528
29+
- Fix CodeLlama SQ accuracy issue #453
30+
* Performance
31+
- MMHA optimization for MQA and GQA
32+
- LoRA optimization: cutlass grouped gemm
33+
- Optimize Hopper warp specialized kernels
34+
- Optimize AllReduce for parallel attention on Falcon and GPT-J
35+
- Enable split-k for weight-only cutlass kernel when SM>=75
36+
* Documentation
37+
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
38+
339
## Versions 0.6.0 / 0.6.1
440

541
* Models

0 commit comments

Comments
 (0)