Skip to content

Commit 250d9c2

Browse files
kaiyuxBhuvanesh09mfuntowiczEddie-Wang1120megha95
authored
Update TensorRT-LLM Release branch (#1445)
* Update TensorRT-LLM --------- Co-authored-by: Bhuvanesh Sridharan <bhuvan.sridharan@gmail.com> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com> Co-authored-by: Eddie-Wang1120 <wangjinheng1120@163.com> Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>
1 parent 37aee91 commit 250d9c2

File tree

1,038 files changed

+3439884
-389685
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,038 files changed

+3439884
-389685
lines changed

.clang-format

+1
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ PenaltyBreakString: 1000
5959
PenaltyExcessCharacter: 1000000
6060
PenaltyReturnTypeOnItsOwnLine: 60
6161
PointerAlignment: Left
62+
QualifierAlignment: Right
6263
ReflowComments: true
6364
SeparateDefinitionBlocks: Always
6465
SortIncludes: CaseSensitive

.gitignore

+10
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,16 @@ venv/
1717
.local/
1818
.hypothesis/
1919
.idea/
20+
dump*/
21+
.trt-internal
22+
*.dot
23+
*.prof
24+
*.log
25+
*.pkl
26+
*.hdf5
27+
*.lock
28+
config.json
29+
/*.svg
2030
cpp/cmake-build-*
2131
cpp/.ccache/
2232
tensorrt_llm/libs

3rdparty/cutlass

Submodule cutlass updated 1833 files

CHANGELOG.md

+76-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,80 @@
11
# Change Log
22

3+
## Versions 0.8.0
4+
5+
* Model Support
6+
- Phi-1.5/2.0
7+
- Mamba support (see examples/mamba/README.md)
8+
- The support is limited to beam width = 1 and single-node single-GPU
9+
- Nougat support (see examples/multimodal/README.md#nougat)
10+
- Qwen-VL support (see examples/qwenvl/README.md)
11+
- RoBERTa support, thanks to the contribution from @erenup
12+
- Skywork model support
13+
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
14+
* Features
15+
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
16+
- LoRA support for C++ runtime (see docs/source/lora.md)
17+
- Medusa decoding support (see examples/medusa/README.md)
18+
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the `temperature` parameter of sampling configuration should be 0
19+
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
20+
- Support for batch manager to return logits from context and/or generation phases
21+
- Include support in the Triton backend
22+
- Support AWQ and GPTQ for QWEN
23+
- Support ReduceScatter plugin
24+
- Support for combining `repetition_penalty` and `presence_penalty` #274
25+
- Support for `frequency_penalty` #275
26+
- OOTB functionality support:
27+
- Baichuan
28+
- InternLM
29+
- Qwen
30+
- BART
31+
- LLaMA
32+
- Support enabling INT4-AWQ along with FP8 KV Cache
33+
- Support BF16 for weight-only plugin
34+
- Baichuan
35+
- P-tuning support
36+
- INT4-AWQ and INT4-GPTQ support
37+
- Decoder iteration-level profiling improvements
38+
- Add `masked_select` and `cumsum` function for modeling
39+
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
40+
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
41+
- Support FP16 fMHA on NVIDIA V100 GPU
42+
* API
43+
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
44+
- **[BREAKING CHANGES]** Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/checkpoint.md)
45+
- **[BREAKING CHANGES]** Deprecate `LayerNorm` and `RMSNorm` plugins and removed corresponding build parameters
46+
- **[BREAKING CHANGES]** Remove optional parameter `maxNumSequences` for GPT manager
47+
* Bug fixes
48+
- Fix the first token being abnormal issue when `--gather_all_token_logits` is enabled #639
49+
- Fix LLaMA with LoRA enabled build failure #673
50+
- Fix InternLM SmoothQuant build failure #705
51+
- Fix Bloom int8_kv_cache functionality #741
52+
- Fix crash in `gptManagerBenchmark` #649
53+
- Fix Blip2 build error #695
54+
- Add pickle support for `InferenceRequest` #701
55+
- Fix Mixtral-8x7b build failure with custom_all_reduce #825
56+
- Fix INT8 GEMM shape #935
57+
- Minor bug fixes
58+
* Performance
59+
- **[BREAKING CHANGES]** Increase default `freeGpuMemoryFraction` parameter from 0.85 to 0.9 for higher throughput
60+
- **[BREAKING CHANGES]** Disable `enable_trt_overlap` argument for GPT manager by default
61+
- Performance optimization of beam search kernel
62+
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
63+
- Custom AllReduce plugins performance optimization
64+
- Top-P sampling performance optimization
65+
- LoRA performance optimization
66+
- Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
67+
- Integrate XQA kernels for GPT-J (beamWidth=4)
68+
* Documentation
69+
- Batch manager arguments documentation updates
70+
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
71+
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)
72+
- Update to the `docs/source/checkpoint.md` documentation
73+
- Update AWQ INT4 weight only quantization documentation for GPT-J
74+
- Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
75+
- Refine TensorRT-LLM backend README structure #133
76+
- Typo fix #739
77+
378
## Versions 0.7.0 / 0.7.1
479

580
* Models
@@ -34,7 +109,7 @@
34109
- Optimize AllReduce for parallel attention on Falcon and GPT-J
35110
- Enable split-k for weight-only cutlass kernel when SM>=75
36111
* Documentation
37-
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
112+
- Add [documentation for convert/build workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/checkpoint.md)
38113

39114
## Versions 0.6.0 / 0.6.1
40115

0 commit comments

Comments
 (0)