Skip to content

Commit 8e47bfb

Browse files
Kosmos2 model (#1483)
* Kosmos2 model * Table of contents * Table of contents * Changed workflow * New openvino version with fixes * Flake8 fixes * Flake8 fixes * Flake8 fixes * Flake8 fixes * Gradio example * Fix misspelling * Ignore treon docker * Fix misspelling * Display bboxes * Display bboxes * Display bboxes and description * Change the number * Spellchecking * Change the number * Improve interactive example * Fix gradio launch * Fix README image * Fix README name
1 parent 36fd474 commit 8e47bfb

8 files changed

+1134
-0
lines changed

.ci/ignore_treon_docker.txt

+1
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
272-paint-by-example
4646
273-stable-zephyr-3b-chatbot
4747
276-stable-diffusion-torchdynamo-backend
48+
281-kosmos2-multimodal-large-language-model
4849
301-tensorflow-training-openvino
4950
305-tensorflow-quantization-aware-training
5051
404-style-transfer-webcam

.ci/ignore_treon_linux.txt

+1
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,5 @@
4848
272-paint-by-example
4949
273-stable-zephyr-3b-chatbot
5050
276-stable-diffusion-torchdynamo-backend
51+
281-kosmos2-multimodal-large-language-model
5152
404-style-transfer-webcam

.ci/ignore_treon_mac.txt

+1
Original file line numberDiff line numberDiff line change
@@ -45,5 +45,6 @@
4545
272-paint-by-example
4646
273-stable-zephyr-3b-chatbot
4747
276-stable-diffusion-torchdynamo-backend
48+
281-kosmos2-multimodal-large-language-model
4849
279-mobilevlm-language-assistant
4950
404-style-transfer-webcam

.ci/ignore_treon_win.txt

+1
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,4 @@
4747
272-paint-by-example
4848
273-stable-zephyr-3b-chatbot
4949
276-stable-diffusion-torchdynamo-backend
50+
281-kosmos2-multimodal-large-language-model

.ci/spellcheck/.pyspelling.wordlist.txt

+7
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ backend
3737
backends
3838
Baevski
3939
BasicUNet
40+
bboxes
4041
BEiT
4142
Belrose
4243
Benchmarking
@@ -98,6 +99,7 @@ ConvNeXt
9899
ConvNeXts
99100
Convolutional
100101
convolutional
102+
coreference
101103
CoSENT
102104
CPUs
103105
cpu
@@ -298,6 +300,9 @@ KiTS
298300
Koltun
299301
Kondate
300302
Kosaraju
303+
kosmos
304+
Kosmos
305+
KOSMOS
301306
KServe
302307
Kubernetes
303308
Kupyn
@@ -363,6 +368,8 @@ mistralai
363368
MLS
364369
mms
365370
MMS
371+
MLLM
372+
MLLMs
366373
MMVLM
367374
MLP
368375
MobileLLaMA

README.md

+3
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ Check out the latest notebooks that show how to optimize and deploy popular mode
5353
|[Stable Diffusion with IP-Adapter](notebooks/278-stable-diffusion-ip-adapter)<br> | Image conditioning in Stable Diffusion pipeline using IP-Adapter | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/182657d9-2aa3-40b3-9fc4-a90b803419fe width=300> |
5454
| [MobileVLM](notebooks/279-mobilevlm-language-assistant)<br> | Mobile language assistant with MobileVLM and OpenVINO | |
5555
| [DepthAnything](notebooks/280-depth-anything)<br>[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/openvinotoolkit/openvino_notebooks/HEAD?filepath=notebooks%2F280-depth-anythingh%2F280-depth-anything.ipynb)<br>[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/main/notebooks/280-depth-anything/280-depth-anything.ipynb) | Monocular Depth estimation with DepthAnything and OpenVINO | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/a9a16658-512f-470c-a33c-0e1f9d0ae72c width=300> |
56+
| [Kosmos-2: Grounding Multimodal Large Language Models](notebooks/281-kosmos2-multimodal-large-language-model)<br> | Kosmos-2: Grounding Multimodal Large Language Model and OpenVINO™ | <img src=https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg width=225> |
5657

5758
## Table of Contents
5859

@@ -233,6 +234,8 @@ Demos that demonstrate inference on a particular model.
233234
| [278-stable-diffusion-ip-adapter](notebooks/278-stable-diffusion-ip-adapter)<br> | Image conditioning in Stable Diffusion pipeline using IP-Adapter | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/182657d9-2aa3-40b3-9fc4-a90b803419fe width=300> |
234235
| [279-mobilevlm-language-assistant](notebooks/279-mobilevlm-language-assistant)<br> | Mobile language assistant with MobileVLM and OpenVINO | |
235236
| [280-depth-anything](notebooks/280-depth-anything)<br>[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/openvinotoolkit/openvino_notebooks/HEAD?filepath=notebooks%2F280-depth-anythingh%2F280-depth-anything.ipynb)<br>[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/main/notebooks/280-depth-anything/280-depth-anything.ipynb) | Monocular Depth Estimation with DepthAnything and OpenVINO | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/a9a16658-512f-470c-a33c-0e1f9d0ae72c width=225> |
237+
| [281-kosmos2-multimodal-large-language-model](notebooks/281-kosmos2-multimodal-large-language-model)<br> | Kosmos-2: Multimodal Large Language Model and OpenVINO™ | <img src=https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg width=225> |
238+
236239

237240
<div id='-model-training'></div>
238241

notebooks/281-kosmos2-multimodal-large-language-model/281-kosmos2-multimodal-large-language-model.ipynb

+1,086
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Kosmos-2: Multimodal Large Language Model and OpenVINO
2+
3+
[KOSMOS-2](https://github.com/microsoft/unilm/tree/master/kosmos-2) is a multimodal large language model (MLLM) that has new capabilities of multimodal grounding and
4+
referring. KOSMOS-2 can understand multimodal input, follow instructions,
5+
perceive object descriptions (e.g., bounding boxes), and ground language to the visual world.
6+
7+
Multimodal Large Language Models (MLLMs) have successfully played a role as a general-purpose interface across a wide
8+
range of tasks, such as language, vision, and vision-language tasks. MLLMs can perceive general modalities, including
9+
texts, images, and audio, and generate responses using free-form texts under zero-shot and few-shot settings.
10+
11+
[In this work](https://arxiv.org/abs/2306.14824), authors unlock the grounding capability for multimodal large
12+
language models. Grounding capability
13+
can provide a more convenient and efficient human-AI interaction for vision-language tasks. It enables the user to point
14+
to the object or region in the image directly rather than input detailed text descriptions to refer to it, the model
15+
can understand that image region with its spatial locations. Grounding capability also enables the model to respond
16+
with visual answers (i.e., bounding boxes), which can support more vision-language tasks such as referring expression
17+
comprehension. Visual answers are more accurate and resolve the coreference ambiguity compared with text-only
18+
responses. In addition, grounding capability can link noun phrases and referring expressions in the generated free-form
19+
text response to the image regions, providing more accurate, informational, and comprehensive answers.
20+
21+
22+
![image](https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg)
23+
24+
## Notebook contents
25+
- Prerequisites
26+
- Infer the original model
27+
- Convert the model to OpenVINO IR
28+
- Inference
29+
- Interactive inference
30+
31+
## Installation instructions
32+
This is a self-contained example that relies solely on its own code.</br>
33+
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
34+
For details, please refer to [Installation Guide](../../README.md).

0 commit comments

Comments
 (0)