Question regarding the nearly double GPU memory consumption. #241

Zhuqln · 2023-06-25T08:23:29Z

Zhuqln
Jun 25, 2023

when i loaded vicuna7b_v1.3 with default config ,i found that gpu memory cost around 23G, but in fastchat's readme says

'requires around 14GB of GPU memory for Vicuna-7B '

anyway, the throughput is much more faster. I just want to confirm if the GPU memory usage in this scenario is normal. If it is normal, I would like to know why there is almost twice the consumption. thanks a lot

Answered by zhuohan123

Jun 25, 2023

Thank you for bringing this up. Yes, this extra memory usage is because of the KV cache. vLLM pre-allocates and reserves the maximum possible amount of memory for KV cache blocks. The KV cache generated during the inference will be written to these reserved memory blocks. You can limit the GPU memory usage by setting the parameter gpu_memory_utilization.

View full answer

Zhuqln · 2023-06-25T09:25:59Z

Zhuqln
Jun 25, 2023
Author

i just read the description of class LLM, is this extra memory usage because of KV cache？

0 replies

zhuohan123 · 2023-06-25T16:50:50Z

zhuohan123
Jun 25, 2023
Maintainer

Thank you for bringing this up. Yes, this extra memory usage is because of the KV cache. vLLM pre-allocates and reserves the maximum possible amount of memory for KV cache blocks. The KV cache generated during the inference will be written to these reserved memory blocks. You can limit the GPU memory usage by setting the parameter gpu_memory_utilization.

5 replies

Zhuqln Jun 26, 2023
Author

Thank you for your response. I have an additional question: If I restrict the GPU memory to reduce the available space for the KV cache, will it affect performance? If it does, will this impact be significant?

zhuohan123 Jun 26, 2023
Maintainer

It will indeed significantly affect performance, especially when the request load is high. The memory size decides the maximum number of requests being batched together in one iteration, which directly determines the maximum throughput you can achieve. Therefore, we suggest allocating as much memory for KV cache as possible.

Zhuqln Jun 27, 2023
Author

got it！ thanks again:)

humza-sami Jan 31, 2024

When I try to set it to gpu_memory_utilization = 0.8 (80%) memory then it gives:

ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Although I am using llama-2 13B model which requires 26Gb VRAM on float16. I am using A40 48GB.

great-luao Mar 10, 2025

Would there be a waste if I set too much gpu usage for vllm?Especially when my batch_size or model is quite small.

oushu1zhangxiangxuan1 · 2023-12-18T07:01:53Z

oushu1zhangxiangxuan1
Dec 18, 2023

How to set gpu_memory_utilization on each GPU?

0 replies

IEI-mjx · 2024-05-07T07:54:13Z

IEI-mjx
May 7, 2024

It will indeed significantly affect performance, especially when the request load is high. The memory size decides the maximum number of requests being batched together in one iteration, which directly determines the maximum throughput you can achieve. Therefore, we suggest allocating as much memory for KV cache as possible.

How to allocating as much memory for KV cache as possible? I tried to implement long text inference with tp_size=8 on a 2B-32experts model. When i used 'watch -n 1 nvidia-smi' to monitor gpu usage, only 50GB/80GB was used. When i increase max_model_len to , OOM is encountered.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding the nearly double GPU memory consumption. #241

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Question regarding the nearly double GPU memory consumption. #241

Zhuqln Jun 25, 2023

Replies: 4 comments · 5 replies

Zhuqln Jun 25, 2023 Author

zhuohan123 Jun 25, 2023 Maintainer

Zhuqln Jun 26, 2023 Author

zhuohan123 Jun 26, 2023 Maintainer

Zhuqln Jun 27, 2023 Author

humza-sami Jan 31, 2024

great-luao Mar 10, 2025

oushu1zhangxiangxuan1 Dec 18, 2023

IEI-mjx May 7, 2024

Zhuqln
Jun 25, 2023

Replies: 4 comments 5 replies

Zhuqln
Jun 25, 2023
Author

zhuohan123
Jun 25, 2023
Maintainer

Zhuqln Jun 26, 2023
Author

zhuohan123 Jun 26, 2023
Maintainer

Zhuqln Jun 27, 2023
Author

oushu1zhangxiangxuan1
Dec 18, 2023

IEI-mjx
May 7, 2024