|
| 1 | +.. {#llm_inference} |
| 2 | +
|
| 3 | +LLM Inference with Hugging Face and Optimum Intel |
| 4 | +===================================================== |
| 5 | + |
| 6 | +The steps below show how to load and infer LLMs from Hugging Face using Optimum Intel. |
| 7 | +They also show how to convert models into OpenVINO IR format so they can be optimized |
| 8 | +by NNCF and used with other OpenVINO tools. |
| 9 | + |
| 10 | +Prerequisites |
| 11 | +############################################################ |
| 12 | + |
| 13 | +* Create a Python environment by following the instructions on the :doc:`Install OpenVINO PIP <openvino_docs_install_guides_overview>` page. |
| 14 | +* Install the necessary dependencies for Optimum Intel: |
| 15 | + |
| 16 | +.. code-block:: console |
| 17 | +
|
| 18 | + pip install optimum[openvino,nncf] |
| 19 | +
|
| 20 | +
|
| 21 | +Loading a Hugging Face Model to Optimum Intel |
| 22 | +############################################################ |
| 23 | + |
| 24 | +To start using OpenVINO as a backend for Hugging Face, change the original Hugging Face code in two places: |
| 25 | + |
| 26 | +.. code-block:: diff |
| 27 | +
|
| 28 | + -from transformers import AutoModelForCausalLM |
| 29 | + +from optimum.intel import OVModelForCausalLM |
| 30 | +
|
| 31 | + model_id = "meta-llama/Llama-2-7b-chat-hf" |
| 32 | + -model = AutoModelForCausalLM.from_pretrained(model_id) |
| 33 | + +model = OVModelForCausalLM.from_pretrained(model_id, export=True) |
| 34 | +
|
| 35 | +
|
| 36 | +Instead of using ``AutoModelForCasualLM`` from the Hugging Face transformers library, |
| 37 | +switch to ``OVModelForCasualLM`` from the optimum.intel library. This change enables |
| 38 | +you to use OpenVINO's optimization features. You may also use other AutoModel types, |
| 39 | +such as ``OVModelForSeq2SeqLM``, though this guide will focus on CausalLM. |
| 40 | + |
| 41 | +By setting the parameter ``export=True``, the model is converted to OpenVINO IR format on the fly. |
| 42 | + |
| 43 | +After that, you can call ``save_pretrained()`` method to save model to the folder in the OpenVINO |
| 44 | +Intermediate Representation and use it further. |
| 45 | + |
| 46 | +.. code-block:: python |
| 47 | +
|
| 48 | + model.save_pretrained("ov_model") |
| 49 | +
|
| 50 | +This will create a new folder called `ov_model` with the LLM in OpenVINO IR format inside. |
| 51 | +You can change the folder and provide another model directory instead of `ov_model`. |
| 52 | + |
| 53 | +Once the model is saved, you can load it with the following command: |
| 54 | + |
| 55 | +.. code-block:: python |
| 56 | +
|
| 57 | + model = OVModelForCausalLM.from_pretrained("ov_model") |
| 58 | +
|
| 59 | +Converting a Hugging Face Model to OpenVINO IR |
| 60 | +############################################################ |
| 61 | + |
| 62 | +The optimum-cli tool allows you to convert models from Hugging Face to |
| 63 | +the OpenVINO IR format: |
| 64 | + |
| 65 | +.. code-block:: python |
| 66 | +
|
| 67 | + optimum-cli export openvino --model <MODEL_NAME> <NEW_MODEL_NAME> |
| 68 | +
|
| 69 | +If you want to convert the `Llama 2` model from Hugging Face to an OpenVINO IR |
| 70 | +model and name it `ov_llama_2`, the command would look like this: |
| 71 | + |
| 72 | +.. code-block:: python |
| 73 | +
|
| 74 | + optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf ov_llama_2 |
| 75 | +
|
| 76 | +In this case, you can load the converted model in OpenVINO representation directly from the disk: |
| 77 | + |
| 78 | +.. code-block:: python |
| 79 | +
|
| 80 | + model_id = "llama_openvino" |
| 81 | + model = OVModelForCausalLM.from_pretrained(model_id) |
| 82 | +
|
| 83 | +
|
| 84 | +Optimum-Intel API also provides out-of-the-box model optimization through weight compression |
| 85 | +using NNCF which substantially reduces the model footprint and inference latency: |
| 86 | + |
| 87 | +.. code-block:: python |
| 88 | +
|
| 89 | + model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=True) |
| 90 | +
|
| 91 | + # or if model was already converted |
| 92 | + model = OVModelForCausalLM.from_pretrained(model_path, load_in_8bit=True) |
| 93 | +
|
| 94 | + # save model after optimization |
| 95 | + model.save_pretrained(optimized_model_path) |
| 96 | +
|
| 97 | +
|
| 98 | +Weight compression is applied by default to models larger than one billion parameters and is |
| 99 | +also available for CLI interface as the ``--int8`` option. |
| 100 | + |
| 101 | +.. note:: |
| 102 | + |
| 103 | + 8-bit weight compression is enabled by default for models larger than 1 billion parameters. |
| 104 | + |
| 105 | +`Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ also provides 4-bit weight |
| 106 | +compression with ``OVWeightQuantizationConfig`` class to control weight quantization parameters. |
| 107 | + |
| 108 | + |
| 109 | +.. code-block:: python |
| 110 | +
|
| 111 | + from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig |
| 112 | + import nncf |
| 113 | +
|
| 114 | + model = OVModelForCausalLM.from_pretrained( |
| 115 | + model_id, |
| 116 | + export=True, |
| 117 | + quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"), |
| 118 | + ) |
| 119 | +
|
| 120 | + # or if model was already converted |
| 121 | + mmodel = OVModelForCausalLM.from_pretrained( |
| 122 | + model_path, |
| 123 | + quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"), |
| 124 | + ) |
| 125 | +
|
| 126 | + # save model after optimization |
| 127 | + model.save_pretrained(optimized_model_path) |
| 128 | +
|
| 129 | +
|
| 130 | +The optimized model can be saved as usual with a call to ``save_pretrained()``. |
| 131 | +For more details on compression options, refer to the :doc:`weight compression guide <weight_compression>`. |
| 132 | + |
| 133 | +.. note:: |
| 134 | + |
| 135 | + OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized |
| 136 | + with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it. |
| 137 | + |
| 138 | +Below are some examples of using Optimum-Intel for model conversion and inference: |
| 139 | + |
| 140 | +* `Instruction following using Databricks Dolly 2.0 and OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/240-dolly-2-instruction-following/240-dolly-2-instruction-following.ipynb>`__ |
| 141 | +* `Create an LLM-powered Chatbot using OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb>`__ |
| 142 | + |
| 143 | +.. note:: |
| 144 | + |
| 145 | + Optimum-Intel can be used for other generative AI models. See `Stable Diffusion v2.1 using Optimum-Intel OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/236-stable-diffusion-v2/236-stable-diffusion-v2-optimum-demo.ipynb>`__ and `Image generation with Stable Diffusion XL and OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/248-stable-diffusion-xl/248-stable-diffusion-xl.ipynb>`__ for more examples. |
| 146 | + |
| 147 | +Inference Example |
| 148 | +############################################################ |
| 149 | + |
| 150 | +For Hugging Face models, the ``AutoTokenizer`` and the ``pipeline`` function are used to create |
| 151 | +an inference pipeline. This setup allows for easy text processing and model interaction: |
| 152 | + |
| 153 | +.. code-block:: python |
| 154 | +
|
| 155 | + from optimum.intel import OVModelForCausalLM |
| 156 | + # new imports for inference |
| 157 | + from transformers import AutoTokenizer |
| 158 | +
|
| 159 | + # load the model |
| 160 | + model_id = "meta-llama/Llama-2-7b-chat-hf" |
| 161 | + model = OVModelForCausalLM.from_pretrained(model_id, export=True) |
| 162 | +
|
| 163 | + # inference |
| 164 | + prompt = "The weather is:" |
| 165 | + tokenizer = AutoTokenizer.from_pretrained(model_id) |
| 166 | + inputs = tokenizer(prompt, return_tensors="pt") |
| 167 | +
|
| 168 | + outputs = model.generate(**inputs, max_new_tokens=50) |
| 169 | + print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| 170 | +
|
| 171 | +.. note:: |
| 172 | + |
| 173 | + Converting LLMs on the fly every time to OpenVINO IR is a resource intensive task. |
| 174 | + It is a good practice to convert the model once, save it in a folder and load it for inference. |
| 175 | + |
| 176 | +By default, inference will run on CPU. To select a different inference device, for example, GPU, |
| 177 | +add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after |
| 178 | +the model has been loaded, use the ``.to()`` method. The device naming convention is the same |
| 179 | +as in OpenVINO native API: |
| 180 | + |
| 181 | +.. code-block:: python |
| 182 | +
|
| 183 | + model.to("GPU") |
| 184 | +
|
| 185 | +Enabling OpenVINO Runtime Optimizations |
| 186 | +############################################################ |
| 187 | + |
| 188 | +OpenVINO runtime provides a set of optimizations for more efficient LLM inference. This includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls and **KV-cache quantization**. |
| 189 | + |
| 190 | +* **Dynamic quantization** enables quantization of activations of MatMul operations that have 4 or 8-bit quantized weights (see :doc:`LLM Weight Compression <weight_compression>`). |
| 191 | + It improves inference latency and throughput of LLMs, though it may cause insignificant deviation in generation accuracy. Quantization is performed in a |
| 192 | + group-wise manner, with configurable group size. It means that values in a group share quantization parameters. Larger group sizes lead to faster inference but lower accuracy. Recommended group size values are: ``32``, ``64``, or ``128``. To enable Dynamic quantization, use the corresponding |
| 193 | + inference property as follows: |
| 194 | + |
| 195 | + |
| 196 | + .. code-block:: python |
| 197 | +
|
| 198 | + model = OVModelForCausalLM.from_pretrained( |
| 199 | + model_path, |
| 200 | + ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"} |
| 201 | + ) |
| 202 | +
|
| 203 | +
|
| 204 | +
|
| 205 | +* **KV-cache quantization** allows lowering the precision of Key and Value cache in LLMs. This helps reduce memory consumption during inference, improving latency and throughput. KV-cache can be quantized into the following precisions: |
| 206 | + ``u8``, ``bf16``, ``f16``. If ``u8`` is used, KV-cache quantization is also applied in a group-wise manner. Thus, it can use ``DYNAMIC_QUANTIZATION_GROUP_SIZE`` value if defined. |
| 207 | + Otherwise, the group size ``32`` is used by default. KV-cache quantization can be enabled as follows: |
| 208 | + |
| 209 | + |
| 210 | + .. code-block:: python |
| 211 | +
|
| 212 | + model = OVModelForCausalLM.from_pretrained( |
| 213 | + model_path, |
| 214 | + ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"} |
| 215 | + ) |
| 216 | +
|
| 217 | +
|
| 218 | +.. note:: |
| 219 | + |
| 220 | + Currently, both Dynamic quantization and KV-cache quantization are available for CPU device. |
| 221 | + |
| 222 | + |
| 223 | +Working with Models Tuned with LoRA |
| 224 | +######################################### |
| 225 | + |
| 226 | +Low-rank Adaptation (LoRA) is a popular method to tune Generative AI models to a downstream task |
| 227 | +or custom data. However, it requires some extra steps to be done for efficient deployment using |
| 228 | +the Hugging Face API. Namely, the trained adapters should be fused into the baseline model to |
| 229 | +avoid extra computation. This is how it can be done for LLMs: |
| 230 | + |
| 231 | +.. code-block:: python |
| 232 | +
|
| 233 | + model_id = "meta-llama/Llama-2-7b-chat-hf" |
| 234 | + lora_adaptor = "./lora_adaptor" |
| 235 | +
|
| 236 | + model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True) |
| 237 | + model = PeftModelForCausalLM.from_pretrained(model, lora_adaptor) |
| 238 | + model.merge_and_unload() |
| 239 | + model.get_base_model().save_pretrained("fused_lora_model") |
| 240 | +
|
| 241 | +
|
| 242 | +Now the model can be converted to OpenVINO using Optimum Intel Python API or CLI interfaces mentioned above. |
| 243 | + |
| 244 | + |
| 245 | +Additional Resources |
| 246 | +##################### |
| 247 | + |
| 248 | +* `Optimum Intel documentation <https://huggingface.co/docs/optimum/intel/inference>`__ |
| 249 | +* :doc:`LLM Weight Compression <weight_compression>` |
| 250 | +* `Neural Network Compression Framework <https://github.com/openvinotoolkit/nncf>`__ |
| 251 | +* `Hugging Face Transformers <https://huggingface.co/docs/transformers/index>`__ |
| 252 | +* `Generation with LLMs <https://huggingface.co/docs/transformers/llm_tutorial>`__ |
| 253 | +* `Pipeline class <https://huggingface.co/docs/transformers/main_classes/pipelines>`__ |
| 254 | +* `GenAI Pipeline Repository <https://github.com/openvinotoolkit/openvino.genai>`__ |
| 255 | +* `OpenVINO Tokenizers <https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/custom_operations/user_ie_extensions/tokenizer/python>`__ |
0 commit comments