add qwen2.5vl model (#2764)

eaidova · web-flow · commit 79cd7e4ed091 · 2025-02-24T11:10:21.000+04:00
CVS-162142
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -206,6 +206,7 @@ dGPU
 dGPUs
 DialoGPT
 diarization
+digitalized
 Diffusers
 diffusers
 dimensionality
@@ -395,6 +396,7 @@ intervaling
 im
 img
 ip
+IPs
 ir
 IRs
 iteratively
@@ -550,6 +552,7 @@ mpnet
 mpt
 MPT
 MRPC
+mRoPE
 MTVQA
 multiarchitecture
 Multiclass
@@ -807,6 +810,7 @@ Rinna
 rinna
 RLHF
 RMBG
+RMSNorm
 RoBERTa
 roberta
 ROI
@@ -1020,6 +1024,7 @@ vits
 VITS
 vitt
 VL
+VL’s
 vl
 vlm
 VLM
diff --git a/notebooks/qwen2.5-vl/README.md b/notebooks/qwen2.5-vl/README.md
@@ -0,0 +1,52 @@
+# Visual-language assistant with Qwen2VL and OpenVINO
+
+ Qwen2.5-VL is the latest addition to the QwenVL series of multimodal large language models.
+
+ ![](https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2.5-vl-Capybara.png)
+
+**Key Enhancements of Qwen2.5VL:**
+* **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
+* **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
+* **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments
+* **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
+* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
+
+**Model Capabilities**:
+* **World-wide Image Recognition**. Qwen2.5-VL has significantly enhanced its general image recognition capabilities, expanding the categories of images to an ultra-large number. It not only includes plants, animals, landmarks of famous mountains and rivers, but also IPs from film and TV series, as well as a wide variety of products.
+* **Precise Object Grounding**. Qwen2.5-VL utilizes bounding boxes and point-based representations for grounding, enabling hierarchical positioning and standardized JSON output. This enhanced localization capability serves as a foundation for visual reasoning.
+* **Enhanced Text Recognition and Understanding**. Qwen2.5-VL has upgraded its OCR recognition capabilities to a new level, with enhanced multi-scenario, multi-language and multi-orientation text recognition and text localization performance. Furthermore, it has been significantly enhanced in information extraction to meet the growing digitalized and intelligent demands in areas such as qualification review and financial business.
+* **Powerful Document Parsing**. Qwen2.5-VL has designed a unique document parsing format called QwenVL HTML format, which extracts layout information based on HTML. QwenVL HTML can perform document parsing in various scenarios, such as magazines, research papers, web pages, and even mobile screenshots.
+* **Enhanced Video Comprehension Ability**. Qwen2.5-VL video comprehension capabilities have been comprehensively upgraded. In terms of temporal processing, we have introduced dynamic frame rate (FPS) training and absolute time encoding technology. As a result, the model can not only support the understanding of ultra-long videos on an hourly scale but also achieve second-level event localization. It is capable of accurately comprehending content from long videos spanning hours, searching for specific events within videos, and summarizing key points from different time segments. This allows users to quickly and efficiently extract crucial information embedded in the videos.
+
+**Model Architecture Details:**
+
+Comparing with Qwen2VL, Qwen2.5VL architecture receives following updates:
+* **Dynamic Resolution and Frame Rate Training for Video Understanding**
+Extended dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, mRoPE was updated in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
+![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg)
+* **Streamlined and Efficient Vision Encoder**
+Qwen2.5VL enhances both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
+
+More details about model can be found in [model card](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), [blog](https://qwenlm.github.io/blog/qwen2.5-vl/), [technical report](https://arxiv.org/abs/2502.13923) and original [repo](https://github.com/QwenLM/Qwen2.5-VL).
+
+In this tutorial we consider how to convert and optimize Qwen2.5VL model for creating multimodal chatbot using [Optimum Intel](https://github.com/huggingface/optimum-intel). Additionally, we demonstrate how to apply model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf
+
+## Notebook contents
+The tutorial consists from following steps:
+
+- Install requirements
+- Convert and Optimize model
+- Run OpenVINO model inference
+- Launch Interactive demo
+
+In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content.
+
+The image bellow illustrates example of input prompt and model answer.
+![example.png](https://github.com/user-attachments/assets/7e12ac6c-12f8-43d8-9c0a-b63d6ecaf20b)
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).
+
+<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/qwen2.5-vl/README.md" />
diff --git a/notebooks/qwen2.5-vl/gradio_helper.py b/notebooks/qwen2.5-vl/gradio_helper.py
@@ -0,0 +1,205 @@
+import gradio as gr
+import copy
+import re
+from threading import Thread
+from transformers import TextIteratorStreamer
+from qwen_vl_utils import process_vision_info
+
+
+def _parse_text(text):
+    lines = text.split("\n")
+    lines = [line for line in lines if line != ""]
+    count = 0
+    for i, line in enumerate(lines):
+        if "```" in line:
+            count += 1
+            items = line.split("`")
+            if count % 2 == 1:
+                lines[i] = f'<pre><code class="language-{items[-1]}">'
+            else:
+                lines[i] = "<br></code></pre>"
+        else:
+            if i > 0:
+                if count % 2 == 1:
+                    line = line.replace("`", r"\`")
+                    line = line.replace("<", "&lt;")
+                    line = line.replace(">", "&gt;")
+                    line = line.replace(" ", "&nbsp;")
+                    line = line.replace("*", "&ast;")
+                    line = line.replace("_", "&lowbar;")
+                    line = line.replace("-", "&#45;")
+                    line = line.replace(".", "&#46;")
+                    line = line.replace("!", "&#33;")
+                    line = line.replace("(", "&#40;")
+                    line = line.replace(")", "&#41;")
+                    line = line.replace("$", "&#36;")
+                lines[i] = "<br>" + line
+    text = "".join(lines)
+    return text
+
+
+def _remove_image_special(text):
+    text = text.replace("<ref>", "").replace("</ref>", "")
+    return re.sub(r"<box>.*?(</box>|$)", "", text)
+
+
+def is_video_file(filename):
+    video_extensions = [".mp4", ".avi", ".mkv", ".mov", ".wmv", ".flv", ".webm", ".mpeg"]
+    return any(filename.lower().endswith(ext) for ext in video_extensions)
+
+
+def transform_messages(original_messages):
+    transformed_messages = []
+    for message in original_messages:
+        new_content = []
+        for item in message["content"]:
+            if "image" in item:
+                new_item = {"type": "image", "image": item["image"]}
+            elif "text" in item:
+                new_item = {"type": "text", "text": item["text"]}
+            elif "video" in item:
+                new_item = {"type": "video", "video": item["video"]}
+            else:
+                continue
+            new_content.append(new_item)
+
+        new_message = {"role": message["role"], "content": new_content}
+        transformed_messages.append(new_message)
+
+    return transformed_messages
+
+
+def make_demo(model, processor):
+    def call_local_model(model, processor, messages):
+        messages = transform_messages(messages)
+
+        text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        image_inputs, video_inputs = process_vision_info(messages)
+        inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)
+
+        tokenizer = processor.tokenizer
+        streamer = TextIteratorStreamer(tokenizer, timeout=3600.0, skip_prompt=True, skip_special_tokens=True)
+
+        gen_kwargs = {"max_new_tokens": 512, "streamer": streamer, **inputs}
+
+        thread = Thread(target=model.generate, kwargs=gen_kwargs)
+        thread.start()
+
+        generated_text = ""
+        for new_text in streamer:
+            generated_text += new_text
+            yield generated_text
+
+    def create_predict_fn():
+        def predict(_chatbot, task_history):
+            chat_query = _chatbot[-1][0]
+            query = task_history[-1][0]
+            if len(chat_query) == 0:
+                _chatbot.pop()
+                task_history.pop()
+                return _chatbot
+            print("User: " + _parse_text(query))
+            history_cp = copy.deepcopy(task_history)
+            full_response = ""
+            messages = []
+            content = []
+            for q, a in history_cp:
+                if isinstance(q, (tuple, list)):
+                    if is_video_file(q[0]):
+                        content.append({"video": f"file://{q[0]}"})
+                    else:
+                        content.append({"image": f"file://{q[0]}"})
+                else:
+                    content.append({"text": q})
+                    messages.append({"role": "user", "content": content})
+                    messages.append({"role": "assistant", "content": [{"text": a}]})
+                    content = []
+            messages.pop()
+
+            for response in call_local_model(model, processor, messages):
+                _chatbot[-1] = (_parse_text(chat_query), _remove_image_special(_parse_text(response)))
+
+                yield _chatbot
+                full_response = _parse_text(response)
+
+            task_history[-1] = (query, full_response)
+            print("Qwen-VL-Chat: " + _parse_text(full_response))
+            yield _chatbot
+
+        return predict
+
+    def create_regenerate_fn():
+        def regenerate(_chatbot, task_history):
+            if not task_history:
+                return _chatbot
+            item = task_history[-1]
+            if item[1] is None:
+                return _chatbot
+            task_history[-1] = (item[0], None)
+            chatbot_item = _chatbot.pop(-1)
+            if chatbot_item[0] is None:
+                _chatbot[-1] = (_chatbot[-1][0], None)
+            else:
+                _chatbot.append((chatbot_item[0], None))
+            _chatbot_gen = predict(_chatbot, task_history)
+            for _chatbot in _chatbot_gen:
+                yield _chatbot
+
+        return regenerate
+
+    predict = create_predict_fn()
+    regenerate = create_regenerate_fn()
+
+    def add_text(history, task_history, text):
+        task_text = text
+        history = history if history is not None else []
+        task_history = task_history if task_history is not None else []
+        history = history + [(_parse_text(text), None)]
+        task_history = task_history + [(task_text, None)]
+        return history, task_history, ""
+
+    def add_file(history, task_history, file):
+        history = history if history is not None else []
+        task_history = task_history if task_history is not None else []
+        history = history + [((file.name,), None)]
+        task_history = task_history + [((file.name,), None)]
+        return history, task_history
+
+    def reset_user_input():
+        return gr.update(value="")
+
+    def reset_state(task_history):
+        task_history.clear()
+        return []
+
+    with gr.Blocks() as demo:
+        gr.Markdown("""<center><font size=8>Qwen2.5-VL OpenVINO demo</center>""")
+
+        chatbot = gr.Chatbot(label="Qwen2-VL", elem_classes="control-height", height=500)
+        query = gr.Textbox(lines=2, label="Input")
+        task_history = gr.State([])
+
+        with gr.Row():
+            addfile_btn = gr.UploadButton("📁 Upload (上传文件)", file_types=["image", "video"])
+            submit_btn = gr.Button("🚀 Submit (发送)")
+            regen_btn = gr.Button("🤔️ Regenerate (重试)")
+            empty_bin = gr.Button("🧹 Clear History (清除历史)")
+
+        submit_btn.click(add_text, [chatbot, task_history, query], [chatbot, task_history]).then(
+            predict, [chatbot, task_history], [chatbot], show_progress=True
+        )
+        submit_btn.click(reset_user_input, [], [query])
+        empty_bin.click(reset_state, [task_history], [chatbot], show_progress=True)
+        regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress=True)
+        addfile_btn.upload(add_file, [chatbot, task_history, addfile_btn], [chatbot, task_history], show_progress=True)
+
+        gr.Markdown(
+            """\
+<font size=2>Note: This demo is governed by the original license of Qwen2.5-VL. \
+We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, \
+including hate speech, violence, pornography, deception, etc. \
+(注：本演示受Qwen2-VL的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，\
+包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)"""
+        )
+
+    return demo
diff --git a/notebooks/qwen2.5-vl/qwen2.5-vl.ipynb b/notebooks/qwen2.5-vl/qwen2.5-vl.ipynb