Skip to content

[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"

License

Notifications You must be signed in to change notification settings

CircleRadon/Osprey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Static Badge arXiv preprint Dataset video Static Badge

Demo username & password: osprey


A part of Along the River During the Qingming Festival (清明上河图)

Spirited Away (千与千寻)
💡 Some of our other multimodal-LLM projects may interest you ✨.

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
github github arXiv

TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li*, Yuqian Yuan*, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang
github github arXiv

Updates 📌

[2025/2/27]🔥 Our new work, VideoRefer Suite, has been accept to CVPR2025! This project focuses on video referring.

[2024/3/29]🔥 We released Osprey-Chat model, which exhibits better conversation and image-level understanding&reasoning capabilities.

[2024/2/27]🔥 Osprey has been accepted to CVPR2024!

[2024/1/15]🔥 We released the evaluation code.

[2023/12/29]🔥 We released the training code and Osprey-724K dataset.

[2023/12/18]🔥 We released the code, osprey-7b model and online demo for Osprey.

What is Osprey 👀

Osprey is a mask-text instruction tuning approach that extends MLLMs by incorporating pixel-wise mask regions into language instructions, enabling fine-grained visual understanding. Based on input mask region, Osprey generate the semantic descriptions including short description and detailed description.

Our Osprey can seamlessly integrate with SAM in point-prompt, box-prompt and segmentation everything modes to generate the semantics associated with specific parts or objects.

Watch Video Demo 🎥

Try Our Demo 🕹️

Online demo

Click 👇 to try our demo online.

web demo

username: osprey
password: osprey

Point

Box

Everything

Offline demo

💻 requirments: For this demo, it needs about 17GB GPU memory for Osprey(15GB) and SAM(2GB).

  1. First install Gradio-Osprey-Demo.
  2. Install Segment Anything.
pip install git+https://github.com/facebookresearch/segment-anything.git
  1. Download all the checkpoints:

The default path of all the checkpoints:

├── demo
    ├── checkpoints
    │   ├── Osprey_7b
    │   └── sam_vit_b_01ec64.pth 
    └── open_clip_pytorch_model.bin

Or change the "mm_vision_tower" in config.json of Osprey-7b model to the Absolute Path of open_clip_pytorch_model.bin.

  1. Run app.py.
cd demo
python app.py --model checkpoints/Osprey_7b

Install 🛠️

  1. Clone this repository and navigate to Osprey folder
git clone https://github.com/CircleRadon/Osprey.git
cd Osprey
  1. Install packages
conda create -n osprey python=3.10 -y
conda activate osprey
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Dataset 🌟

The all datasets for training can be found in Dataset preparation.

Osprey-724K: 🤗Hugging Face

Osprey-724K is an instruction dataset with mask-text pairs, containing around 724K GPT-generated multimodal dialogues to encourage MLLMs for fine-grained pixel-level image understanding. It contains object-level, part-level and additional instruction samples for robustness and flexibility.

Training 🚀

  • Stage1: Image-Text Alignment Pre-training

    • The pretrained projector weights for Convnext-large-CLIP can be found in projector weights.
  • Stage2: Mask-Text Alignment Pre-training

    • Download vicuna-7b-v1.5.
    • Download projector weights trained in stage1: projector weights.
    • Set model_name_or_path in stage2.sh to the path of vicuna-7b-v1.5.
    • Set pretrain_mm_mlp_adapter in stage2.sh to the path of mm_projector.
    • Set vision_tower in stage2.sh to the path of Convnext-large-CLIP-model.
    • Run sh scripts/stage2.sh.
  • Stage3: End-to-End Fine-tuning

    • Set model_name_or_path in stage2.sh to the path of stage2 checkpoint.
    • Set vision_tower in stage2.sh to the path of Convnext-large-CLIP-model.
    • Run sh scripts/stage3.sh.

Checkpoints 🤖

Osprey-7b model🤗: model

We also provide the checkpoint of intermediate stage2, please check model.

Evaluation 🔎

See evaluation for details.

TODO List 📝

  • Release the checkpoints, inference codes and demo.
  • Release the dataset and training scripts.
  • Release the evaluation code.
  • Release the code for data generation pipeline.

Acknowledgement 💌

  • LLaVA-v1.5: the codebase we built upon.
  • SAM: the demo uses the segmentation result from SAM as the input of Osprey.

BibTeX 🖊️

@misc{Osprey,
  title={Osprey: Pixel Understanding with Visual Instruction Tuning},
  author={Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang and Jianke Zhu},
  year={2023},
  eprint={2312.10032},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

About

[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published