openvinotoolkit
diff --git a/‎.ci/ignore_treon_docker.txt
+2-1 b/‎.ci/ignore_treon_docker.txt
+2-1
diff --git a/‎.ci/skipped_notebooks.yml
+8-1 b/‎.ci/skipped_notebooks.yml
+8-1
diff --git a/‎notebooks/llm-chatbot/README.md
+1 b/‎notebooks/llm-chatbot/README.md
+1
diff --git a/‎notebooks/llm-chatbot/llm-chatbot-generate-api.ipynb
+217-224 b/‎notebooks/llm-chatbot/llm-chatbot-generate-api.ipynb
+217-224
diff --git a/‎notebooks/phi-4-multimodal/README.md
+21 b/‎notebooks/phi-4-multimodal/README.md
+21
diff --git a/‎notebooks/phi-4-multimodal/gradio_helper.py
+245 b/‎notebooks/phi-4-multimodal/gradio_helper.py
+245
@@ -82,4 +82,5 @@ notebooks/llm-agent-react/llm-agent-react-langchain.ipynb
 notebooks/multimodal-rag/multimodal-rag-llamaindex.ipynb
 notebooks/llm-rag-langchain/llm-rag-langchain-genai.ipynb
 notebooks/ltx-video/ltx-video.ipynb
-notebooks/outetts-text-to-speech/outetts-text-to-speech.ipynb
+notebooks/outetts-text-to-speech/outetts-text-to-speech.ipynb
+notebooks/phi-4-multimodal/phi-4-multimodal.ipynb
@@ -559,4 +559,11 @@
 - notebook: notebooks/paddle-to-openvino/paddle-to-openvino-classification.ipynb
   skips:
     - python:
-        - '3.12'
+        - '3.12'
+- notebook: notebooks/phi-4-multimodal/phi-4-multimodal.ipynb
+  skips:
+    - os:
+        - macos-13
+        - ubuntu-20.04
+        - ubuntu-22.04
+        - windows-2019
@@ -36,6 +36,7 @@ The available options are:
 * **red-pajama-3b-chat** - A 2.8B parameter pre-trained language model based on GPT-NEOX architecture. It was developed by Together Computer and leaders from the open-source AI community. The model is fine-tuned on OASST1 and Dolly2 datasets to enhance chatting ability. More details about model can be found in [HuggingFace model card](https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1).
 * **phi3-mini-instruct** - The Phi-3-Mini is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. More details about model can be found in [model card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), [Microsoft blog](https://aka.ms/phi3blog-april) and [technical report](https://aka.ms/phi3-tech-report).
 * **phi-3.5-mini-instruct** - Phi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family and supports 128K token context length. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures. More details about model can be found in [model card](https://huggingface.co/microsoft/Phi-3.5-mini-instruct), [Microsoft blog](https://aka.ms/phi3.5-techblog) and [technical report](https://arxiv.org/abs/2404.14219).
+* **phi-4-mini-instruct** - Phi-4-mini is a lightweight, state-of-the-art open model built upon a blend of synthetic datasets and data from filtered public domain websites with focus on high-quality, reasoning on dense data. More details about model can be found in model card.
 * **phi-4** -  Phi-4 is 14B model that built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. Phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. More details about model can be found in [model_card](https://huggingface.co/microsoft/phi-4), [technical report](https://arxiv.org/pdf/2412.08905) and [Microsoft blog](https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090).
 *  **gemma-7b-it** - Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. This model is instruction-tuned version of 7B parameters model. More details about model can be found in [model card](https://huggingface.co/google/gemma-7b-it).
 >**Note**: run model with demo, you will need to accept license agreement. 
 
@@ -0,0 +1,21 @@
+# Multimodal assistant with Phi-4-mini-multimodal and OpenVINO
+
+Phi-4-multimodal-instruct is a lightweight open multimodal foundation model. The model processes text, image, and audio inputs, generating text outputs. Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-mini as the  backbone language model, and the advanced encoders and adapters of vision and speech.
+In this tutorial we will explore how to run Phi-4-mini-multimodal model using [OpenVINO](https://github.com/openvinotoolkit/openvino) and optimize it using [NNCF](https://github.com/openvinotoolkit/nncf).
+
+## Notebook contents
+The tutorial consists from following steps:
+
+- Install requirements
+- Convert and Optimize model
+- Run OpenVINO model inference
+- Launch Interactive demo
+
+In this demonstration, you'll create interactive chatbot that can answer questions about provided image's and audio's content.
+![phi4](https://github.com/user-attachments/assets/8c0b8e50-417e-4579-b799-e9b9c15e8a87)
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).
+<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/phi-4-multimodal/README.md" />
@@ -0,0 +1,245 @@
+from copy import deepcopy
+from typing import Dict, List
+from PIL import Image
+import librosa
+from transformers import TextIteratorStreamer
+from threading import Thread
+import gradio as gr
+
+IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".webp")
+AUDIO_EXTENSIONS = (".mp3", ".wav", "flac", ".m4a", ".wma")
+
+IMAGE_SPECIAL = "<|endoftext10|>"
+AUDIO_SPECIAL = "<|endoftext11|>"
+
+DEFAULT_SAMPLING_PARAMS = {
+    "top_p": 0.0,
+    "top_k": 1,
+    "temperature": 0.0,
+    "do_sample": True,
+    "num_beams": 1,
+    "repetition_penalty": 1.2,
+}
+MAX_NEW_TOKENS = 512
+
+
+def history2messages(history: List[Dict]) -> List[Dict]:
+    """
+    Transform gradio history to chat messages.
+    """
+    print(history)
+    messages = []
+    cur_message = dict()
+    images = []
+    audios = []
+    cur_special_tags = ""
+    for item in history:
+        if item["role"] == "assistant":
+            if len(cur_message) > 0:
+                cur_message["content"] = cur_special_tags + cur_message["content"]
+                messages.append(deepcopy(cur_message))
+                cur_message = dict()
+                cur_special_tags = ""
+            messages.append({"role": "assistant", "content": item["content"]})
+            continue
+
+        if "role" not in cur_message:
+            cur_message["role"] = "user"
+        if "content" not in cur_message:
+            cur_message["content"] = ""
+
+        if "metadata" not in item:
+            item["metadata"] = {"title": None}
+        if item["metadata"].get("title") is None:
+            cur_message["content"] = item["content"]
+        elif item["metadata"]["title"] == "image":
+            cur_special_tags += IMAGE_SPECIAL
+            images.append(Image.open(item["content"][0]))
+        elif item["metadata"]["title"] == "audio":
+            cur_special_tags += AUDIO_SPECIAL
+            audios.append(librosa.load(item["content"][0]))
+    if len(cur_message) > 0:
+        cur_message["content"] = cur_special_tags + cur_message["content"]
+        messages.append(cur_message)
+    return messages, images, audios
+
+
+def check_messages(history, message, audio):
+    has_text = message["text"] and message["text"].strip()
+    has_files = len(message["files"]) > 0
+    has_audio = audio is not None
+
+    if not (has_text or has_files or has_audio):
+        raise gr.Error("Message is empty")
+
+    audios = []
+    images = []
+
+    for file_msg in message["files"]:
+        if file_msg.endswith(AUDIO_EXTENSIONS):
+            duration = librosa.get_duration(filename=file_msg)
+            if duration > 60:
+                raise gr.Error("Audio file too long. For efficiency we recommend to use audio < 60s")
+            if duration == 0:
+                raise gr.Error("Audio file too short")
+            audios.append(file_msg)
+        elif file_msg.endswith(IMAGE_EXTENSIONS):
+            images.append(file_msg)
+        else:
+            filename = file_msg.split("/")[-1]
+            raise gr.Error(f"Unsupported file type: {filename}. It should be an image or audio file.")
+
+    if len(audios) > 1:
+        raise gr.Error("Please upload only one audio file.")
+
+    if len(images) > 1:
+        raise gr.Error("Please upload only one image file.")
+
+    if audio is not None:
+        if len(audios) > 0:
+            raise gr.Error("Please upload only one audio file or record audio.")
+        audios.append(audio)
+
+    # Append the message to the history
+    for image in images:
+        history.append({"role": "user", "content": (image,), "metadata": {"title": "image"}})
+
+    for audio in audios:
+        history.append({"role": "user", "content": (audio,), "metadata": {"title": "audio"}})
+
+    if message["text"]:
+        history.append({"role": "user", "content": message["text"], "metadata": {}})
+
+    return history, gr.MultimodalTextbox(value=None, interactive=False), None
+
+
+def make_demo(ov_model, processor):
+    def bot(
+        history: list,
+        top_p: float,
+        top_k: int,
+        temperature: float,
+        repetition_penalty: float,
+        max_new_tokens: int = MAX_NEW_TOKENS,
+        regenerate: bool = False,
+    ):
+
+        print(history)
+        if history and regenerate:
+            history = history[:-1]
+
+        if not history:
+            return history
+
+        msgs, images, audios = history2messages(history)
+        audios = audios if len(audios) > 0 else None
+        images = images if len(images) > 0 else None
+
+        print(msgs)
+        prompt = processor.tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
+        print(prompt)
+        inputs = processor(text=prompt, audios=audios, images=images)
+        streamer = TextIteratorStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
+        generation_params = {
+            "top_p": top_p,
+            "top_k": top_k,
+            "temperature": temperature,
+            "repetition_penalty": repetition_penalty,
+            "max_new_tokens": max_new_tokens,
+            "do_sample": temperature > 0,
+            "streamer": streamer,
+            **inputs,
+        }
+
+        history.append({"role": "assistant", "content": ""})
+
+        thread = Thread(target=ov_model.generate, kwargs=generation_params)
+        thread.start()
+
+        buffer = ""
+        for new_text in streamer:
+            buffer += new_text
+            history[-1]["content"] = buffer
+            yield history
+
+    def change_state(state):
+        return gr.update(visible=not state), not state
+
+    def reset_user_input():
+        return gr.update(value="")
+
+    with gr.Blocks(theme=gr.themes.Soft()) as demo:
+        gr.Markdown("# 🪐 Chat with OpenVINO Phi-4-multimodal")
+        chatbot = gr.Chatbot(elem_id="chatbot", bubble_full_width=False, type="messages", height="48vh")
+
+        sampling_params_group_hidden_state = gr.State(False)
+
+        with gr.Row(equal_height=True):
+            chat_input = gr.MultimodalTextbox(
+                file_count="multiple",
+                placeholder="Enter your prompt or upload image/audio here, then press ENTER...",
+                show_label=False,
+                scale=8,
+                file_types=["image", "audio"],
+                interactive=True,
+                # stop_btn=True,
+            )
+        with gr.Row(equal_height=True):
+            audio_input = gr.Audio(sources=["microphone", "upload"], type="filepath", scale=1, max_length=30)
+        with gr.Row(equal_height=True):
+            with gr.Column(scale=1, min_width=150):
+                with gr.Row(equal_height=True):
+                    regenerate_btn = gr.Button("Regenerate", variant="primary")
+                    clear_btn = gr.ClearButton([chat_input, audio_input, chatbot])
+
+        with gr.Row():
+            sampling_params_toggle_btn = gr.Button("Sampling Parameters")
+
+        with gr.Group(visible=False) as sampling_params_group:
+            with gr.Row():
+                temperature = gr.Slider(minimum=0, maximum=1, value=DEFAULT_SAMPLING_PARAMS["temperature"], label="Temperature")
+                repetition_penalty = gr.Slider(
+                    minimum=0,
+                    maximum=2,
+                    value=DEFAULT_SAMPLING_PARAMS["repetition_penalty"],
+                    label="Repetition Penalty",
+                )
+
+            with gr.Row():
+                top_p = gr.Slider(minimum=0, maximum=1, value=DEFAULT_SAMPLING_PARAMS["top_p"], label="Top-p")
+                top_k = gr.Slider(minimum=0, maximum=1000, value=DEFAULT_SAMPLING_PARAMS["top_k"], label="Top-k")
+
+            with gr.Row():
+                max_new_tokens = gr.Slider(
+                    minimum=1,
+                    maximum=MAX_NEW_TOKENS,
+                    value=MAX_NEW_TOKENS,
+                    label="Max New Tokens",
+                    interactive=True,
+                )
+
+        sampling_params_toggle_btn.click(
+            change_state,
+            sampling_params_group_hidden_state,
+            [sampling_params_group, sampling_params_group_hidden_state],
+        )
+        chat_msg = chat_input.submit(
+            check_messages,
+            [chatbot, chat_input, audio_input],
+            [chatbot, chat_input, audio_input],
+        )
+
+        bot_msg = chat_msg.then(
+            bot,
+            inputs=[chatbot, top_p, top_k, temperature, repetition_penalty, max_new_tokens],
+            outputs=chatbot,
+        )
+
+        bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])
+
+        regenerate_btn.click(
+            bot,
+            inputs=[chatbot, top_p, top_k, temperature, repetition_penalty, max_new_tokens, gr.State(True)],
+            outputs=chatbot,
+        )
+    return demo