Skip to content

Commit 9729894

Browse files
authored
siglip2 (#2768)
1 parent ec32ee6 commit 9729894

File tree

3 files changed

+636
-125
lines changed

3 files changed

+636
-125
lines changed

.ci/spellcheck/.pyspelling.wordlist.txt

+2
Original file line numberDiff line numberDiff line change
@@ -565,6 +565,7 @@ multinomial
565565
MusicGen
566566
MuRAG
567567
Müller
568+
naflex
568569
Nakayosi
569570
nano
570571
nanoLLaVA
@@ -1028,6 +1029,7 @@ VL’s
10281029
vl
10291030
vlm
10301031
VLM
1032+
VLMs
10311033
VLModel
10321034
VLMPipeline
10331035
VM

notebooks/siglip-zero-shot-image-classification/README.md

+22-13
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,44 @@
1-
# Zero-shot Image Classification with SigLIP
1+
# Zero-shot Image Classification with SigLIP2
22

33
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/siglip-zero-shot-image-classification/siglip-zero-shot-image-classification.ipynb)
44

55
Zero-shot image classification is a computer vision task with the goal to classify images into one of several classes without any prior training or knowledge of these classes.
66

77
![zero-shot-pipeline](https://user-images.githubusercontent.com/29454499/207773481-d77cacf8-6cdc-4765-a31b-a1669476d620.png)
88

9-
In this tutorial, you will use the [SigLIP](https://huggingface.co/docs/transformers/main/en/model_doc/siglip) model to perform zero-shot image classification.
9+
In this tutorial, you will use the [SigLIP2](https://huggingface.co/blog/siglip2) model to perform zero-shot image classification.
1010

1111
## Notebook Contents
1212

13-
This tutorial demonstrates how to perform zero-shot image classification using the open-source SigLIP model. The SigLIP model was proposed in the [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) paper. SigLIP suggests replacing the loss function used in [CLIP](https://github.com/openai/CLIP) (Contrastive Language–Image Pre-training) with a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.
13+
The SigLIP model was proposed in [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343). SigLIP proposes to replace the loss function used in [CLIP](https://github.com/openai/CLIP) (Contrastive Language–Image Pre-training) by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.
1414

15-
![siglip-performance-comparison](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg)
15+
The abstract from the paper is the following:
1616

17-
[\*_image source_](https://arxiv.org/abs/2303.15343)
17+
> We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes.
1818
1919
You can find more information about this model in the [research paper](https://arxiv.org/abs/2303.15343), [GitHub repository](https://github.com/google-research/big_vision), [Hugging Face model page](https://huggingface.co/docs/transformers/main/en/model_doc/siglip).
2020

21+
[SigLIP 2](https://huggingface.co/papers/2502.14786) extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. SigLIP 2 models outperform the older SigLIP ones at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). More details about SigLIP 2 can be found in [blog post](https://huggingface.co/blog/siglip2)
22+
23+
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/sg2-blog/decoder.png).
24+
25+
SigLIP 2 models outperform the older SigLIP ones at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs).
26+
A cherry on top is the dynamic resolution (naflex) variant. This is useful for downstream tasks sensitive to aspect ratio and resolution.
27+
28+
In this notebook, we will use [google/siglip2-base-patch16-224](https://huggingface.co/google/siglip2-base-patch16-224) by default, but the same steps are applicable for other SigLIP family models.
29+
2130
The notebook contains the following steps:
2231

2332
1. Instantiate model.
24-
1. Run PyTorch model inference.
25-
1. Convert the model to OpenVINO Intermediate Representation (IR) format.
26-
1. Run OpenVINO model.
27-
1. Apply post-training quantization using [NNCF](https://github.com/openvinotoolkit/nncf):
33+
2. Run PyTorch model inference.
34+
3. Convert the model to OpenVINO Intermediate Representation (IR) format.
35+
4. Run OpenVINO model.
36+
5. Apply post-training quantization using [NNCF](https://github.com/openvinotoolkit/nncf):
2837
1. Prepare dataset.
29-
1. Quantize model.
30-
1. Run quantized OpenVINO model.
31-
1. Compare File Size.
32-
1. Compare inference time of the FP16 IR and quantized models.
38+
2. Quantize model.
39+
3. Run quantized OpenVINO model.
40+
4. Compare File Size.
41+
5. Compare inference time of the FP16 IR and quantized models.
3342

3443
The results of the SigLIP model's performance in zero-shot image classification from this notebook are demonstrated in the image below.
3544
![image](https://github.com/openvinotoolkit/openvino_notebooks/assets/67365453/c4eb782c-0fef-4a89-a5c6-5cc43518490b)

notebooks/siglip-zero-shot-image-classification/siglip-zero-shot-image-classification.ipynb

+612-112
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)