Kosmos2 model (#1483)

aleksandr-mokrov · web-flow · commit 8e47bfbd8c29 · 2024-01-29T11:59:22.000+04:00
* Kosmos2 model

* Table of contents

* Table of contents

* Changed workflow

* New openvino version with fixes

* Flake8 fixes

* Flake8 fixes

* Flake8 fixes

* Flake8 fixes

* Gradio example

* Fix misspelling

* Ignore treon docker

* Fix misspelling

* Display bboxes

* Display bboxes

* Display bboxes and description

* Change the number

* Spellchecking

* Change the number

* Improve interactive example

* Fix gradio launch

* Fix README image

* Fix README name
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -45,6 +45,7 @@
 272-paint-by-example
 273-stable-zephyr-3b-chatbot
 276-stable-diffusion-torchdynamo-backend
+281-kosmos2-multimodal-large-language-model
 301-tensorflow-training-openvino
 305-tensorflow-quantization-aware-training
 404-style-transfer-webcam
diff --git a/.ci/ignore_treon_linux.txt b/.ci/ignore_treon_linux.txt
@@ -48,4 +48,5 @@
 272-paint-by-example
 273-stable-zephyr-3b-chatbot
 276-stable-diffusion-torchdynamo-backend
+281-kosmos2-multimodal-large-language-model
 404-style-transfer-webcam
diff --git a/.ci/ignore_treon_mac.txt b/.ci/ignore_treon_mac.txt
@@ -45,5 +45,6 @@
 272-paint-by-example
 273-stable-zephyr-3b-chatbot
 276-stable-diffusion-torchdynamo-backend
+281-kosmos2-multimodal-large-language-model
 279-mobilevlm-language-assistant
 404-style-transfer-webcam
diff --git a/.ci/ignore_treon_win.txt b/.ci/ignore_treon_win.txt
@@ -47,3 +47,4 @@
 272-paint-by-example
 273-stable-zephyr-3b-chatbot
 276-stable-diffusion-torchdynamo-backend
+281-kosmos2-multimodal-large-language-model
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -37,6 +37,7 @@ backend
 backends
 Baevski
 BasicUNet
+bboxes
 BEiT
 Belrose
 Benchmarking
@@ -98,6 +99,7 @@ ConvNeXt
 ConvNeXts
 Convolutional
 convolutional
+coreference
 CoSENT
 CPUs
 cpu
@@ -298,6 +300,9 @@ KiTS
 Koltun
 Kondate
 Kosaraju
+kosmos
+Kosmos
+KOSMOS
 KServe
 Kubernetes
 Kupyn
@@ -363,6 +368,8 @@ mistralai
 MLS
 mms
 MMS
+MLLM
+MLLMs
 MMVLM
 MLP
 MobileLLaMA
diff --git a/README.md b/README.md
@@ -53,6 +53,7 @@ Check out the latest notebooks that show how to optimize and deploy popular mode
 |[Stable Diffusion with IP-Adapter](notebooks/278-stable-diffusion-ip-adapter)<br> | Image conditioning in Stable Diffusion pipeline using IP-Adapter | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/182657d9-2aa3-40b3-9fc4-a90b803419fe width=300> |
 | [MobileVLM](notebooks/279-mobilevlm-language-assistant)<br> | Mobile language assistant with MobileVLM and OpenVINO | |
 | [DepthAnything](notebooks/280-depth-anything)<br>[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/openvinotoolkit/openvino_notebooks/HEAD?filepath=notebooks%2F280-depth-anythingh%2F280-depth-anything.ipynb)<br>[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/main/notebooks/280-depth-anything/280-depth-anything.ipynb) | Monocular Depth estimation with DepthAnything and OpenVINO | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/a9a16658-512f-470c-a33c-0e1f9d0ae72c width=300> |
+| [Kosmos-2: Grounding Multimodal Large Language Models](notebooks/281-kosmos2-multimodal-large-language-model)<br> | Kosmos-2: Grounding Multimodal Large Language Model and OpenVINO™ | <img src=https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg width=225> |
 
 ## Table of Contents
 
@@ -233,6 +234,8 @@ Demos that demonstrate inference on a particular model.
 | [278-stable-diffusion-ip-adapter](notebooks/278-stable-diffusion-ip-adapter)<br> | Image conditioning in Stable Diffusion pipeline using IP-Adapter |  <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/182657d9-2aa3-40b3-9fc4-a90b803419fe width=300> |
 | [279-mobilevlm-language-assistant](notebooks/279-mobilevlm-language-assistant)<br> | Mobile language assistant with MobileVLM and OpenVINO | |
 | [280-depth-anything](notebooks/280-depth-anything)<br>[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/openvinotoolkit/openvino_notebooks/HEAD?filepath=notebooks%2F280-depth-anythingh%2F280-depth-anything.ipynb)<br>[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/main/notebooks/280-depth-anything/280-depth-anything.ipynb) | Monocular Depth Estimation with DepthAnything and OpenVINO |  <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/a9a16658-512f-470c-a33c-0e1f9d0ae72c width=225> |
+| [281-kosmos2-multimodal-large-language-model](notebooks/281-kosmos2-multimodal-large-language-model)<br> | Kosmos-2: Multimodal Large Language Model and OpenVINO™ | <img src=https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg width=225> |
+
 
 <div id='-model-training'></div>
 
diff --git a/notebooks/281-kosmos2-multimodal-large-language-model/281-kosmos2-multimodal-large-language-model.ipynb b/notebooks/281-kosmos2-multimodal-large-language-model/281-kosmos2-multimodal-large-language-model.ipynb
diff --git a/notebooks/281-kosmos2-multimodal-large-language-model/README.md b/notebooks/281-kosmos2-multimodal-large-language-model/README.md
@@ -0,0 +1,34 @@
+# Kosmos-2: Multimodal Large Language Model and OpenVINO
+
+[KOSMOS-2](https://github.com/microsoft/unilm/tree/master/kosmos-2) is a multimodal large language model (MLLM) that has new capabilities of multimodal grounding and 
+referring. KOSMOS-2 can understand multimodal input, follow instructions, 
+perceive object descriptions (e.g., bounding boxes), and ground language to the visual world.
+
+Multimodal Large Language Models (MLLMs) have successfully played a role as a general-purpose interface across a wide 
+range of tasks, such as language, vision, and vision-language tasks. MLLMs can perceive general modalities, including 
+texts, images, and audio, and generate responses using free-form texts under zero-shot and few-shot settings. 
+
+[In this work](https://arxiv.org/abs/2306.14824), authors unlock the grounding capability for multimodal large 
+language models. Grounding capability 
+can provide a more convenient and efficient human-AI interaction for vision-language tasks. It enables the user to point
+ to the object or region in the image directly rather than input detailed text descriptions to refer to it, the model 
+ can understand that image region with its spatial locations. Grounding capability also enables the model to respond 
+ with visual answers (i.e., bounding boxes), which can support more vision-language tasks such as referring expression 
+ comprehension. Visual answers are more accurate and resolve the coreference ambiguity compared with text-only 
+ responses. In addition, grounding capability can link noun phrases and referring expressions in the generated free-form 
+ text response to the image regions, providing more accurate, informational, and comprehensive answers.
+
+
+![image](https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg)
+
+## Notebook contents
+- Prerequisites
+- Infer the original model
+- Convert the model to OpenVINO IR
+- Inference
+- Interactive inference
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).