Question regarding the nearly double GPU memory consumption. #241
-
when i loaded vicuna7b_v1.3 with default config ,i found that gpu memory cost around 23G, but in fastchat's readme says
anyway, the throughput is much more faster. I just want to confirm if the GPU memory usage in this scenario is normal. If it is normal, I would like to know why there is almost twice the consumption. thanks a lot |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 5 replies
-
i just read the description of class LLM, is this extra memory usage because of KV cache? |
Beta Was this translation helpful? Give feedback.
-
Thank you for bringing this up. Yes, this extra memory usage is because of the KV cache. vLLM pre-allocates and reserves the maximum possible amount of memory for KV cache blocks. The KV cache generated during the inference will be written to these reserved memory blocks. You can limit the GPU memory usage by setting the parameter |
Beta Was this translation helpful? Give feedback.
-
How to set gpu_memory_utilization on each GPU? |
Beta Was this translation helpful? Give feedback.
-
How to allocating as much memory for KV cache as possible? I tried to implement long text inference with tp_size=8 on a 2B-32experts model. When i used 'watch -n 1 nvidia-smi' to monitor gpu usage, only 50GB/80GB was used. When i increase max_model_len to , OOM is encountered. |
Beta Was this translation helpful? Give feedback.
Thank you for bringing this up. Yes, this extra memory usage is because of the KV cache. vLLM pre-allocates and reserves the maximum possible amount of memory for KV cache blocks. The KV cache generated during the inference will be written to these reserved memory blocks. You can limit the GPU memory usage by setting the parameter
gpu_memory_utilization
.