GitHub - CircleRadon/Osprey: [CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"

Demo username & password: osprey

A part of Along the River During the Qingming Festival (清明上河图)

Spirited Away (千与千寻)

💡 Some of our other multimodal-LLM projects may interest you ✨.

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing

TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li*, Yuqian Yuan*, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang

Updates 📌

[2025/2/27]🔥 Our new work, VideoRefer Suite, has been accept to CVPR2025! This project focuses on video referring.

[2024/3/29]🔥 We released Osprey-Chat model, which exhibits better conversation and image-level understanding&reasoning capabilities.

[2024/2/27]🔥 Osprey has been accepted to CVPR2024!

[2024/1/15]🔥 We released the evaluation code.

[2023/12/29]🔥 We released the training code and Osprey-724K dataset.

[2023/12/18]🔥 We released the code, osprey-7b model and online demo for Osprey.

What is Osprey 👀

Osprey is a mask-text instruction tuning approach that extends MLLMs by incorporating pixel-wise mask regions into language instructions, enabling fine-grained visual understanding. Based on input mask region, Osprey generate the semantic descriptions including short description and detailed description.

Our Osprey can seamlessly integrate with SAM in point-prompt, box-prompt and segmentation everything modes to generate the semantics associated with specific parts or objects.

Watch Video Demo 🎥

Try Our Demo 🕹️

Online demo

Click 👇 to try our demo online.

web demo

username: osprey
password: osprey

Point
Box
Everything

Offline demo

💻 requirments: For this demo, it needs about 17GB GPU memory for Osprey(15GB) and SAM(2GB).

First install Gradio-Osprey-Demo.
Install Segment Anything.

pip install git+https://github.com/facebookresearch/segment-anything.git

Download all the checkpoints:

The default path of all the checkpoints:

├── demo
    ├── checkpoints
    │   ├── Osprey_7b
    │   └── sam_vit_b_01ec64.pth 
    └── open_clip_pytorch_model.bin

Or change the "mm_vision_tower" in config.json of Osprey-7b model to the Absolute Path of open_clip_pytorch_model.bin.

Run app.py.

cd demo
python app.py --model checkpoints/Osprey_7b

Install 🛠️

Clone this repository and navigate to Osprey folder

git clone https://github.com/CircleRadon/Osprey.git
cd Osprey

Install packages

conda create -n osprey python=3.10 -y
conda activate osprey
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Dataset 🌟

The all datasets for training can be found in Dataset preparation.

Osprey-724K: 🤗Hugging Face

Osprey-724K is an instruction dataset with mask-text pairs, containing around 724K GPT-generated multimodal dialogues to encourage MLLMs for fine-grained pixel-level image understanding. It contains object-level, part-level and additional instruction samples for robustness and flexibility.

Training 🚀

Stage1: Image-Text Alignment Pre-training
- The pretrained projector weights for Convnext-large-CLIP can be found in projector weights.
Stage2: Mask-Text Alignment Pre-training
- Download vicuna-7b-v1.5.
- Download projector weights trained in stage1: projector weights.
- Set model_name_or_path in stage2.sh to the path of vicuna-7b-v1.5.
- Set pretrain_mm_mlp_adapter in stage2.sh to the path of mm_projector.
- Set vision_tower in stage2.sh to the path of Convnext-large-CLIP-model.
- Run sh scripts/stage2.sh.
Stage3: End-to-End Fine-tuning
- Set model_name_or_path in stage2.sh to the path of stage2 checkpoint.
- Set vision_tower in stage2.sh to the path of Convnext-large-CLIP-model.
- Run sh scripts/stage3.sh.

Checkpoints 🤖

Osprey-7b model🤗: model

We also provide the checkpoint of intermediate stage2, please check model.

Evaluation 🔎

See evaluation for details.

TODO List 📝

Release the checkpoints, inference codes and demo.
Release the dataset and training scripts.
Release the evaluation code.
Release the code for data generation pipeline.

Acknowledgement 💌

LLaVA-v1.5: the codebase we built upon.
SAM: the demo uses the segmentation result from SAM as the input of Osprey.

BibTeX 🖊️

@misc{Osprey,
  title={Osprey: Pixel Understanding with Visual Instruction Tuning},
  author={Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang and Jianke Zhu},
  year={2023},
  eprint={2312.10032},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
demo		demo
osprey		osprey
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.md		dataset.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updates 📌

What is Osprey 👀

Watch Video Demo 🎥

Try Our Demo 🕹️

Online demo

Offline demo

Install 🛠️

Dataset 🌟

Training 🚀

Checkpoints 🤖

Evaluation 🔎

TODO List 📝

Acknowledgement 💌

BibTeX 🖊️

About

Releases

Packages

Contributors 2

Languages

License

CircleRadon/Osprey

Folders and files

Latest commit

History

Repository files navigation

Updates 📌

What is Osprey 👀

Watch Video Demo 🎥

Try Our Demo 🕹️

Online demo

Offline demo

Install 🛠️

Dataset 🌟

Training 🚀

Checkpoints 🤖

Evaluation 🔎

TODO List 📝

Acknowledgement 💌

BibTeX 🖊️

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages