Skip to content

Commit 79cd7e4

Browse files
authored
add qwen2.5vl model (#2764)
CVS-162142
1 parent b1e4ac7 commit 79cd7e4

File tree

4 files changed

+907
-0
lines changed

4 files changed

+907
-0
lines changed

.ci/spellcheck/.pyspelling.wordlist.txt

+5
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,7 @@ dGPU
206206
dGPUs
207207
DialoGPT
208208
diarization
209+
digitalized
209210
Diffusers
210211
diffusers
211212
dimensionality
@@ -395,6 +396,7 @@ intervaling
395396
im
396397
img
397398
ip
399+
IPs
398400
ir
399401
IRs
400402
iteratively
@@ -550,6 +552,7 @@ mpnet
550552
mpt
551553
MPT
552554
MRPC
555+
mRoPE
553556
MTVQA
554557
multiarchitecture
555558
Multiclass
@@ -807,6 +810,7 @@ Rinna
807810
rinna
808811
RLHF
809812
RMBG
813+
RMSNorm
810814
RoBERTa
811815
roberta
812816
ROI
@@ -1020,6 +1024,7 @@ vits
10201024
VITS
10211025
vitt
10221026
VL
1027+
VL’s
10231028
vl
10241029
vlm
10251030
VLM

notebooks/qwen2.5-vl/README.md

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Visual-language assistant with Qwen2VL and OpenVINO
2+
3+
Qwen2.5-VL is the latest addition to the QwenVL series of multimodal large language models.
4+
5+
![](https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2.5-vl-Capybara.png)
6+
7+
**Key Enhancements of Qwen2.5VL:**
8+
* **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
9+
* **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
10+
* **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments
11+
* **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
12+
* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
13+
14+
**Model Capabilities**:
15+
* **World-wide Image Recognition**. Qwen2.5-VL has significantly enhanced its general image recognition capabilities, expanding the categories of images to an ultra-large number. It not only includes plants, animals, landmarks of famous mountains and rivers, but also IPs from film and TV series, as well as a wide variety of products.
16+
* **Precise Object Grounding**. Qwen2.5-VL utilizes bounding boxes and point-based representations for grounding, enabling hierarchical positioning and standardized JSON output. This enhanced localization capability serves as a foundation for visual reasoning.
17+
* **Enhanced Text Recognition and Understanding**. Qwen2.5-VL has upgraded its OCR recognition capabilities to a new level, with enhanced multi-scenario, multi-language and multi-orientation text recognition and text localization performance. Furthermore, it has been significantly enhanced in information extraction to meet the growing digitalized and intelligent demands in areas such as qualification review and financial business.
18+
* **Powerful Document Parsing**. Qwen2.5-VL has designed a unique document parsing format called QwenVL HTML format, which extracts layout information based on HTML. QwenVL HTML can perform document parsing in various scenarios, such as magazines, research papers, web pages, and even mobile screenshots.
19+
* **Enhanced Video Comprehension Ability**. Qwen2.5-VL video comprehension capabilities have been comprehensively upgraded. In terms of temporal processing, we have introduced dynamic frame rate (FPS) training and absolute time encoding technology. As a result, the model can not only support the understanding of ultra-long videos on an hourly scale but also achieve second-level event localization. It is capable of accurately comprehending content from long videos spanning hours, searching for specific events within videos, and summarizing key points from different time segments. This allows users to quickly and efficiently extract crucial information embedded in the videos.
20+
21+
**Model Architecture Details:**
22+
23+
Comparing with Qwen2VL, Qwen2.5VL architecture receives following updates:
24+
* **Dynamic Resolution and Frame Rate Training for Video Understanding**
25+
Extended dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, mRoPE was updated in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
26+
![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg)
27+
* **Streamlined and Efficient Vision Encoder**
28+
Qwen2.5VL enhances both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
29+
30+
More details about model can be found in [model card](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), [blog](https://qwenlm.github.io/blog/qwen2.5-vl/), [technical report](https://arxiv.org/abs/2502.13923) and original [repo](https://github.com/QwenLM/Qwen2.5-VL).
31+
32+
In this tutorial we consider how to convert and optimize Qwen2.5VL model for creating multimodal chatbot using [Optimum Intel](https://github.com/huggingface/optimum-intel). Additionally, we demonstrate how to apply model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf
33+
34+
## Notebook contents
35+
The tutorial consists from following steps:
36+
37+
- Install requirements
38+
- Convert and Optimize model
39+
- Run OpenVINO model inference
40+
- Launch Interactive demo
41+
42+
In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content.
43+
44+
The image bellow illustrates example of input prompt and model answer.
45+
![example.png](https://github.com/user-attachments/assets/7e12ac6c-12f8-43d8-9c0a-b63d6ecaf20b)
46+
47+
## Installation instructions
48+
This is a self-contained example that relies solely on its own code.</br>
49+
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
50+
For details, please refer to [Installation Guide](../../README.md).
51+
52+
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/qwen2.5-vl/README.md" />

notebooks/qwen2.5-vl/gradio_helper.py

+205
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
import gradio as gr
2+
import copy
3+
import re
4+
from threading import Thread
5+
from transformers import TextIteratorStreamer
6+
from qwen_vl_utils import process_vision_info
7+
8+
9+
def _parse_text(text):
10+
lines = text.split("\n")
11+
lines = [line for line in lines if line != ""]
12+
count = 0
13+
for i, line in enumerate(lines):
14+
if "```" in line:
15+
count += 1
16+
items = line.split("`")
17+
if count % 2 == 1:
18+
lines[i] = f'<pre><code class="language-{items[-1]}">'
19+
else:
20+
lines[i] = "<br></code></pre>"
21+
else:
22+
if i > 0:
23+
if count % 2 == 1:
24+
line = line.replace("`", r"\`")
25+
line = line.replace("<", "&lt;")
26+
line = line.replace(">", "&gt;")
27+
line = line.replace(" ", "&nbsp;")
28+
line = line.replace("*", "&ast;")
29+
line = line.replace("_", "&lowbar;")
30+
line = line.replace("-", "&#45;")
31+
line = line.replace(".", "&#46;")
32+
line = line.replace("!", "&#33;")
33+
line = line.replace("(", "&#40;")
34+
line = line.replace(")", "&#41;")
35+
line = line.replace("$", "&#36;")
36+
lines[i] = "<br>" + line
37+
text = "".join(lines)
38+
return text
39+
40+
41+
def _remove_image_special(text):
42+
text = text.replace("<ref>", "").replace("</ref>", "")
43+
return re.sub(r"<box>.*?(</box>|$)", "", text)
44+
45+
46+
def is_video_file(filename):
47+
video_extensions = [".mp4", ".avi", ".mkv", ".mov", ".wmv", ".flv", ".webm", ".mpeg"]
48+
return any(filename.lower().endswith(ext) for ext in video_extensions)
49+
50+
51+
def transform_messages(original_messages):
52+
transformed_messages = []
53+
for message in original_messages:
54+
new_content = []
55+
for item in message["content"]:
56+
if "image" in item:
57+
new_item = {"type": "image", "image": item["image"]}
58+
elif "text" in item:
59+
new_item = {"type": "text", "text": item["text"]}
60+
elif "video" in item:
61+
new_item = {"type": "video", "video": item["video"]}
62+
else:
63+
continue
64+
new_content.append(new_item)
65+
66+
new_message = {"role": message["role"], "content": new_content}
67+
transformed_messages.append(new_message)
68+
69+
return transformed_messages
70+
71+
72+
def make_demo(model, processor):
73+
def call_local_model(model, processor, messages):
74+
messages = transform_messages(messages)
75+
76+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
77+
image_inputs, video_inputs = process_vision_info(messages)
78+
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)
79+
80+
tokenizer = processor.tokenizer
81+
streamer = TextIteratorStreamer(tokenizer, timeout=3600.0, skip_prompt=True, skip_special_tokens=True)
82+
83+
gen_kwargs = {"max_new_tokens": 512, "streamer": streamer, **inputs}
84+
85+
thread = Thread(target=model.generate, kwargs=gen_kwargs)
86+
thread.start()
87+
88+
generated_text = ""
89+
for new_text in streamer:
90+
generated_text += new_text
91+
yield generated_text
92+
93+
def create_predict_fn():
94+
def predict(_chatbot, task_history):
95+
chat_query = _chatbot[-1][0]
96+
query = task_history[-1][0]
97+
if len(chat_query) == 0:
98+
_chatbot.pop()
99+
task_history.pop()
100+
return _chatbot
101+
print("User: " + _parse_text(query))
102+
history_cp = copy.deepcopy(task_history)
103+
full_response = ""
104+
messages = []
105+
content = []
106+
for q, a in history_cp:
107+
if isinstance(q, (tuple, list)):
108+
if is_video_file(q[0]):
109+
content.append({"video": f"file://{q[0]}"})
110+
else:
111+
content.append({"image": f"file://{q[0]}"})
112+
else:
113+
content.append({"text": q})
114+
messages.append({"role": "user", "content": content})
115+
messages.append({"role": "assistant", "content": [{"text": a}]})
116+
content = []
117+
messages.pop()
118+
119+
for response in call_local_model(model, processor, messages):
120+
_chatbot[-1] = (_parse_text(chat_query), _remove_image_special(_parse_text(response)))
121+
122+
yield _chatbot
123+
full_response = _parse_text(response)
124+
125+
task_history[-1] = (query, full_response)
126+
print("Qwen-VL-Chat: " + _parse_text(full_response))
127+
yield _chatbot
128+
129+
return predict
130+
131+
def create_regenerate_fn():
132+
def regenerate(_chatbot, task_history):
133+
if not task_history:
134+
return _chatbot
135+
item = task_history[-1]
136+
if item[1] is None:
137+
return _chatbot
138+
task_history[-1] = (item[0], None)
139+
chatbot_item = _chatbot.pop(-1)
140+
if chatbot_item[0] is None:
141+
_chatbot[-1] = (_chatbot[-1][0], None)
142+
else:
143+
_chatbot.append((chatbot_item[0], None))
144+
_chatbot_gen = predict(_chatbot, task_history)
145+
for _chatbot in _chatbot_gen:
146+
yield _chatbot
147+
148+
return regenerate
149+
150+
predict = create_predict_fn()
151+
regenerate = create_regenerate_fn()
152+
153+
def add_text(history, task_history, text):
154+
task_text = text
155+
history = history if history is not None else []
156+
task_history = task_history if task_history is not None else []
157+
history = history + [(_parse_text(text), None)]
158+
task_history = task_history + [(task_text, None)]
159+
return history, task_history, ""
160+
161+
def add_file(history, task_history, file):
162+
history = history if history is not None else []
163+
task_history = task_history if task_history is not None else []
164+
history = history + [((file.name,), None)]
165+
task_history = task_history + [((file.name,), None)]
166+
return history, task_history
167+
168+
def reset_user_input():
169+
return gr.update(value="")
170+
171+
def reset_state(task_history):
172+
task_history.clear()
173+
return []
174+
175+
with gr.Blocks() as demo:
176+
gr.Markdown("""<center><font size=8>Qwen2.5-VL OpenVINO demo</center>""")
177+
178+
chatbot = gr.Chatbot(label="Qwen2-VL", elem_classes="control-height", height=500)
179+
query = gr.Textbox(lines=2, label="Input")
180+
task_history = gr.State([])
181+
182+
with gr.Row():
183+
addfile_btn = gr.UploadButton("📁 Upload (上传文件)", file_types=["image", "video"])
184+
submit_btn = gr.Button("🚀 Submit (发送)")
185+
regen_btn = gr.Button("🤔️ Regenerate (重试)")
186+
empty_bin = gr.Button("🧹 Clear History (清除历史)")
187+
188+
submit_btn.click(add_text, [chatbot, task_history, query], [chatbot, task_history]).then(
189+
predict, [chatbot, task_history], [chatbot], show_progress=True
190+
)
191+
submit_btn.click(reset_user_input, [], [query])
192+
empty_bin.click(reset_state, [task_history], [chatbot], show_progress=True)
193+
regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress=True)
194+
addfile_btn.upload(add_file, [chatbot, task_history, addfile_btn], [chatbot, task_history], show_progress=True)
195+
196+
gr.Markdown(
197+
"""\
198+
<font size=2>Note: This demo is governed by the original license of Qwen2.5-VL. \
199+
We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, \
200+
including hate speech, violence, pornography, deception, etc. \
201+
(注:本演示受Qwen2-VL的许可协议限制。我们强烈建议,用户不应传播及不应允许他人传播以下内容,\
202+
包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)"""
203+
)
204+
205+
return demo

0 commit comments

Comments
 (0)