All datasets should be under a dataset root folder (--data-path
). An example of dataset root folder is here. Some directories already contain processed data. We will fill the empty/partial directories by the following instructions.
├── activitynet-captions
│ └── activitynet_frames
│ └── v_00Dk03Jr70M
│ ...
└── youcook2
└── youcook2_frames
└── 01lB162koHA
...
-
Download ActivityNet frames from ActivityNet Challenge (link) and put under
activitynet-captions
. -
From dataset root
cd activitynet-captions
tar -xf frames_activitynet_5fps_res_320x240.tar
wget https://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
unzip -qq captions.zip
-
Download videos from YouCook2.
-
Extract video frames and put under
youcook2/youcook2_frames
. Forffmpeg
, we use the following command
ffmpeg -i <video_path> -y -an -qscale 0 -vf "fps=5,scale=320:240" <output_folder>/%06d.jpg
└── nextqa
-
Download videos for NExT-QA and put under
nextqa
. -
From dataset root
cd nextqa
unzip -qq NExTVideo.zip
└── temporal_reasoning
The processed data for our Reasoning Temporal Localization (RTL) is already under the temporal_reasoning
folder. We use videos from ActivityNet prepared above, so no further preparation is needed.
├── LLaVA-Instruct-150K
└── coco
Follow LLaVA data preparation. From dataset root
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
mkdir coco
cd coco
wget http://images.cocodataset.org/zips/train2017.zip
unzip -qq train2017.zip