Skip to content

Commit 6027732

Browse files
authored
Merge branch 'master' into issue#22945
2 parents cf2fe3f + 090bf92 commit 6027732

File tree

393 files changed

+19681
-3068
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

393 files changed

+19681
-3068
lines changed

.github/workflows/build_doc.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
lfs: 'true'
2222

2323
- name: Install apt-get dependencies
24-
uses: awalsh128/cache-apt-pkgs-action@v1.4.1
24+
uses: awalsh128/cache-apt-pkgs-action@v1.4.2
2525
with:
2626
packages: graphviz texlive liblua5.2-0 libclang1-9 libclang-cpp9
2727
version: 3.0

.github/workflows/code_snippets.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
submodules: 'true'
3131

3232
- name: Install OpenCL
33-
uses: awalsh128/cache-apt-pkgs-action@v1.4.1
33+
uses: awalsh128/cache-apt-pkgs-action@v1.4.2
3434
if: runner.os == 'Linux'
3535
with:
3636
packages: ocl-icd-opencl-dev opencl-headers

.github/workflows/job_samples_tests.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ jobs:
124124
125125
source ${INSTALL_DIR}/setupvars.sh
126126
127-
PYTHONCOERCECLOCALE=warn python3 -bb -W error -X dev -X warn_default_encoding -m pytest $INSTALL_TEST_DIR/smoke_tests \
127+
PYTHONCOERCECLOCALE=warn python3 -bb -W error -X dev -m pytest $INSTALL_TEST_DIR/smoke_tests \
128128
--junitxml=$INSTALL_TEST_DIR/TEST-SamplesSmokeTests.xml
129129
130130
- name: Upload Test Results

.github/workflows/linux.yml

+13-8
Original file line numberDiff line numberDiff line change
@@ -318,7 +318,7 @@ jobs:
318318

319319
Conformance:
320320
needs: [ Build, Smart_CI ]
321-
timeout-minutes: ${{ matrix.TEST_TYPE == 'API' && 5 || 30 }}
321+
timeout-minutes: ${{ matrix.TEST_TYPE == 'API' && 5 || 20 }}
322322
defaults:
323323
run:
324324
shell: bash
@@ -511,18 +511,23 @@ jobs:
511511
runner: 'ubuntu-20.04-8-cores'
512512
model_scope: 'precommit'
513513

514-
TensorFlow_Models_Tests_Nightly:
515-
name: TensorFlow Models tests
514+
TensorFlow_Models_Tests_Nightly_TF_HUB:
515+
name: TensorFlow TF Hub Models tests
516+
if: ${{ github.event_name == 'schedule' }}
517+
needs: [ Build, Smart_CI, Openvino_tokenizers ]
518+
uses: ./.github/workflows/job_tensorflow_models_tests.yml
519+
with:
520+
runner: 'ubuntu-20.04-16-cores'
521+
model_scope: 'nightly_tf_hub'
522+
523+
TensorFlow_Models_Tests_Nightly_HF:
524+
name: TensorFlow Hugging Face Models tests
516525
if: ${{ github.event_name == 'schedule' }}
517526
needs: [ Build, Smart_CI, Openvino_tokenizers ]
518-
strategy:
519-
max-parallel: 2
520-
matrix:
521-
MODEL_SCOPE: ['nightly_hf', 'nightly_tf_hub']
522527
uses: ./.github/workflows/job_tensorflow_models_tests.yml
523528
with:
524529
runner: 'ubuntu-20.04-16-cores'
525-
model_scope: ${{ matrix.MODEL_SCOPE }}
530+
model_scope: 'nightly_hf'
526531

527532
# TODO: Switch back to self-hosted runners
528533
# container:

.github/workflows/linux_arm64.yml

+6-1
Original file line numberDiff line numberDiff line change
@@ -116,8 +116,13 @@ jobs:
116116
- name: Install build dependencies
117117
run: |
118118
bash ${OPENVINO_REPO}/install_build_dependencies.sh
119+
119120
# default-jdk - Java API
120-
apt install --assume-yes --no-install-recommends default-jdk
121+
apt install --assume-yes --no-install-recommends default-jdk gcc-10 g++-10
122+
123+
# Set gcc-10 as a default one
124+
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 30
125+
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 30
121126
122127
- name: Install sccache
123128
uses: mozilla-actions/sccache-action@v0.0.4

docs/articles_en/learn-openvino.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Learn OpenVINO
1616

1717
Interactive Tutorials (Python) <tutorials>
1818
Sample Applications (Python & C++) <openvino_docs_OV_UG_Samples_Overview>
19-
Generative AI Optimization and Deployment <gen_ai_guide>
19+
Large Language Models Inference Guide <native_vs_hugging_face_api>
2020

2121

2222
This section will help you get a hands-on experience with OpenVINO even if you are just starting
@@ -30,5 +30,5 @@ as well as an experienced user.
3030
| :doc:`OpenVINO Samples <openvino_docs_OV_UG_Samples_Overview>`
3131
| The OpenVINO samples (Python and C++) are simple console applications that show how to use specific OpenVINO API features. They can assist you in executing tasks such as loading a model, running inference, querying particular device capabilities, etc.
3232
33-
| :doc:`Optimize and Deploy Generative AI Models <gen_ai_guide>`
33+
| :doc:`Large Language Models in OpenVINO <native_vs_hugging_face_api>`
3434
| Detailed information on how OpenVINO accelerates Generative AI use cases and what models it supports. This tutorial provides instructions for running Generative AI models using Hugging Face Optimum Intel and Native OpenVINO APIs.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
.. {#llm_inference}
2+
3+
LLM Inference with Hugging Face and Optimum Intel
4+
=====================================================
5+
6+
The steps below show how to load and infer LLMs from Hugging Face using Optimum Intel.
7+
They also show how to convert models into OpenVINO IR format so they can be optimized
8+
by NNCF and used with other OpenVINO tools.
9+
10+
Prerequisites
11+
############################################################
12+
13+
* Create a Python environment by following the instructions on the :doc:`Install OpenVINO PIP <openvino_docs_install_guides_overview>` page.
14+
* Install the necessary dependencies for Optimum Intel:
15+
16+
.. code-block:: console
17+
18+
pip install optimum[openvino,nncf]
19+
20+
21+
Loading a Hugging Face Model to Optimum Intel
22+
############################################################
23+
24+
To start using OpenVINO as a backend for Hugging Face, change the original Hugging Face code in two places:
25+
26+
.. code-block:: diff
27+
28+
-from transformers import AutoModelForCausalLM
29+
+from optimum.intel import OVModelForCausalLM
30+
31+
model_id = "meta-llama/Llama-2-7b-chat-hf"
32+
-model = AutoModelForCausalLM.from_pretrained(model_id)
33+
+model = OVModelForCausalLM.from_pretrained(model_id, export=True)
34+
35+
36+
Instead of using ``AutoModelForCasualLM`` from the Hugging Face transformers library,
37+
switch to ``OVModelForCasualLM`` from the optimum.intel library. This change enables
38+
you to use OpenVINO's optimization features. You may also use other AutoModel types,
39+
such as ``OVModelForSeq2SeqLM``, though this guide will focus on CausalLM.
40+
41+
By setting the parameter ``export=True``, the model is converted to OpenVINO IR format on the fly.
42+
43+
After that, you can call ``save_pretrained()`` method to save model to the folder in the OpenVINO
44+
Intermediate Representation and use it further.
45+
46+
.. code-block:: python
47+
48+
model.save_pretrained("ov_model")
49+
50+
This will create a new folder called `ov_model` with the LLM in OpenVINO IR format inside.
51+
You can change the folder and provide another model directory instead of `ov_model`.
52+
53+
Once the model is saved, you can load it with the following command:
54+
55+
.. code-block:: python
56+
57+
model = OVModelForCausalLM.from_pretrained("ov_model")
58+
59+
Converting a Hugging Face Model to OpenVINO IR
60+
############################################################
61+
62+
The optimum-cli tool allows you to convert models from Hugging Face to
63+
the OpenVINO IR format:
64+
65+
.. code-block:: python
66+
67+
optimum-cli export openvino --model <MODEL_NAME> <NEW_MODEL_NAME>
68+
69+
If you want to convert the `Llama 2` model from Hugging Face to an OpenVINO IR
70+
model and name it `ov_llama_2`, the command would look like this:
71+
72+
.. code-block:: python
73+
74+
optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf ov_llama_2
75+
76+
In this case, you can load the converted model in OpenVINO representation directly from the disk:
77+
78+
.. code-block:: python
79+
80+
model_id = "llama_openvino"
81+
model = OVModelForCausalLM.from_pretrained(model_id)
82+
83+
84+
Optimum-Intel API also provides out-of-the-box model optimization through weight compression
85+
using NNCF which substantially reduces the model footprint and inference latency:
86+
87+
.. code-block:: python
88+
89+
model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=True)
90+
91+
# or if model was already converted
92+
model = OVModelForCausalLM.from_pretrained(model_path, load_in_8bit=True)
93+
94+
# save model after optimization
95+
model.save_pretrained(optimized_model_path)
96+
97+
98+
Weight compression is applied by default to models larger than one billion parameters and is
99+
also available for CLI interface as the ``--int8`` option.
100+
101+
.. note::
102+
103+
8-bit weight compression is enabled by default for models larger than 1 billion parameters.
104+
105+
`Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ also provides 4-bit weight
106+
compression with ``OVWeightQuantizationConfig`` class to control weight quantization parameters.
107+
108+
109+
.. code-block:: python
110+
111+
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
112+
import nncf
113+
114+
model = OVModelForCausalLM.from_pretrained(
115+
model_id,
116+
export=True,
117+
quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
118+
)
119+
120+
# or if model was already converted
121+
mmodel = OVModelForCausalLM.from_pretrained(
122+
model_path,
123+
quantization_config=OVWeightQuantizationConfig(bits=4, asym=True, ratio=0.8, dataset="ptb"),
124+
)
125+
126+
# save model after optimization
127+
model.save_pretrained(optimized_model_path)
128+
129+
130+
The optimized model can be saved as usual with a call to ``save_pretrained()``.
131+
For more details on compression options, refer to the :doc:`weight compression guide <weight_compression>`.
132+
133+
.. note::
134+
135+
OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized
136+
with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.
137+
138+
Below are some examples of using Optimum-Intel for model conversion and inference:
139+
140+
* `Instruction following using Databricks Dolly 2.0 and OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/240-dolly-2-instruction-following/240-dolly-2-instruction-following.ipynb>`__
141+
* `Create an LLM-powered Chatbot using OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb>`__
142+
143+
.. note::
144+
145+
Optimum-Intel can be used for other generative AI models. See `Stable Diffusion v2.1 using Optimum-Intel OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/236-stable-diffusion-v2/236-stable-diffusion-v2-optimum-demo.ipynb>`__ and `Image generation with Stable Diffusion XL and OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/248-stable-diffusion-xl/248-stable-diffusion-xl.ipynb>`__ for more examples.
146+
147+
Inference Example
148+
############################################################
149+
150+
For Hugging Face models, the ``AutoTokenizer`` and the ``pipeline`` function are used to create
151+
an inference pipeline. This setup allows for easy text processing and model interaction:
152+
153+
.. code-block:: python
154+
155+
from optimum.intel import OVModelForCausalLM
156+
# new imports for inference
157+
from transformers import AutoTokenizer
158+
159+
# load the model
160+
model_id = "meta-llama/Llama-2-7b-chat-hf"
161+
model = OVModelForCausalLM.from_pretrained(model_id, export=True)
162+
163+
# inference
164+
prompt = "The weather is:"
165+
tokenizer = AutoTokenizer.from_pretrained(model_id)
166+
inputs = tokenizer(prompt, return_tensors="pt")
167+
168+
outputs = model.generate(**inputs, max_new_tokens=50)
169+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
170+
171+
.. note::
172+
173+
Converting LLMs on the fly every time to OpenVINO IR is a resource intensive task.
174+
It is a good practice to convert the model once, save it in a folder and load it for inference.
175+
176+
By default, inference will run on CPU. To select a different inference device, for example, GPU,
177+
add ``device="GPU"`` to the ``from_pretrained()`` call. To switch to a different device after
178+
the model has been loaded, use the ``.to()`` method. The device naming convention is the same
179+
as in OpenVINO native API:
180+
181+
.. code-block:: python
182+
183+
model.to("GPU")
184+
185+
Enabling OpenVINO Runtime Optimizations
186+
############################################################
187+
188+
OpenVINO runtime provides a set of optimizations for more efficient LLM inference. This includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls and **KV-cache quantization**.
189+
190+
* **Dynamic quantization** enables quantization of activations of MatMul operations that have 4 or 8-bit quantized weights (see :doc:`LLM Weight Compression <weight_compression>`).
191+
It improves inference latency and throughput of LLMs, though it may cause insignificant deviation in generation accuracy. Quantization is performed in a
192+
group-wise manner, with configurable group size. It means that values in a group share quantization parameters. Larger group sizes lead to faster inference but lower accuracy. Recommended group size values are: ``32``, ``64``, or ``128``. To enable Dynamic quantization, use the corresponding
193+
inference property as follows:
194+
195+
196+
.. code-block:: python
197+
198+
model = OVModelForCausalLM.from_pretrained(
199+
model_path,
200+
ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
201+
)
202+
203+
204+
205+
* **KV-cache quantization** allows lowering the precision of Key and Value cache in LLMs. This helps reduce memory consumption during inference, improving latency and throughput. KV-cache can be quantized into the following precisions:
206+
``u8``, ``bf16``, ``f16``. If ``u8`` is used, KV-cache quantization is also applied in a group-wise manner. Thus, it can use ``DYNAMIC_QUANTIZATION_GROUP_SIZE`` value if defined.
207+
Otherwise, the group size ``32`` is used by default. KV-cache quantization can be enabled as follows:
208+
209+
210+
.. code-block:: python
211+
212+
model = OVModelForCausalLM.from_pretrained(
213+
model_path,
214+
ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
215+
)
216+
217+
218+
.. note::
219+
220+
Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.
221+
222+
223+
Working with Models Tuned with LoRA
224+
#########################################
225+
226+
Low-rank Adaptation (LoRA) is a popular method to tune Generative AI models to a downstream task
227+
or custom data. However, it requires some extra steps to be done for efficient deployment using
228+
the Hugging Face API. Namely, the trained adapters should be fused into the baseline model to
229+
avoid extra computation. This is how it can be done for LLMs:
230+
231+
.. code-block:: python
232+
233+
model_id = "meta-llama/Llama-2-7b-chat-hf"
234+
lora_adaptor = "./lora_adaptor"
235+
236+
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True)
237+
model = PeftModelForCausalLM.from_pretrained(model, lora_adaptor)
238+
model.merge_and_unload()
239+
model.get_base_model().save_pretrained("fused_lora_model")
240+
241+
242+
Now the model can be converted to OpenVINO using Optimum Intel Python API or CLI interfaces mentioned above.
243+
244+
245+
Additional Resources
246+
#####################
247+
248+
* `Optimum Intel documentation <https://huggingface.co/docs/optimum/intel/inference>`__
249+
* :doc:`LLM Weight Compression <weight_compression>`
250+
* `Neural Network Compression Framework <https://github.com/openvinotoolkit/nncf>`__
251+
* `Hugging Face Transformers <https://huggingface.co/docs/transformers/index>`__
252+
* `Generation with LLMs <https://huggingface.co/docs/transformers/llm_tutorial>`__
253+
* `Pipeline class <https://huggingface.co/docs/transformers/main_classes/pipelines>`__
254+
* `GenAI Pipeline Repository <https://github.com/openvinotoolkit/openvino.genai>`__
255+
* `OpenVINO Tokenizers <https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/custom_operations/user_ie_extensions/tokenizer/python>`__

0 commit comments

Comments
 (0)