diff --git a/docs/README.md b/docs/README.md
index 020e351bc..2c9123632 100755
--- a/docs/README.md
+++ b/docs/README.md
@@ -8,4 +8,5 @@ Majority of this documentation is adapted from [lm-eval-harness](https://github.
* To learn about the command line flags, see the [commands](commands.md)
* To learn how to add a new moddel, see the [Model Guide](model_guide.md).
-* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md).
\ No newline at end of file
+* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md).
+* If you need to upload your datasets into correct HF format with viewer supported, please refer to [tools](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/pufanyi/hf_dataset_docs/tools)
diff --git a/tools/make_image_hf_dataset.ipynb b/tools/make_image_hf_dataset.ipynb
new file mode 100755
index 000000000..89fa0a71f
--- /dev/null
+++ b/tools/make_image_hf_dataset.ipynb
@@ -0,0 +1,292 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Contribute Your Models and Datasets\n",
+ "\n",
+ "[](https://colab.research.google.com/github/EvolvingLMMs-Lab/lmms-eval/blob/main/tools/make_image_hf_dataset.ipynb)\n",
+ "\n",
+ "This notebook will guide you to make correct format of Huggingface dataset, in proper parquet format and visualizable in Huggingface dataset hub.\n",
+ "\n",
+ "We will take the example of the dataset [`lmms-lab/VQAv2_TOY`](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY) and convert it to the proper format."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Preparation\n",
+ "\n",
+ "We need to install `datasets` library to create the dataset and `Pillow` to handle images."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "vscode": {
+ "languageId": "shellscript"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "!pip install datasets Pillow"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And we need to login into Hugging Face to upload the dataset. You should goto the [Hugging Face website](https://huggingface.co/settings/tokens) to get your API token."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "vscode": {
+ "languageId": "shellscript"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "!huggingface-cli login --token hf_YOUR_HF_TOKEN # replace hf_YOUR_HF_TOKEN to your own Hugging Face token."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "plaintext"
+ }
+ },
+ "source": [
+ "## Download Dataset\n",
+ "\n",
+ "We have uploaded the zip file of the dataset to [Hugging Face](https://huggingface.co/datasets/pufanyi/VQAv2_TOY/tree/main/source_data) for download. This dataset is a subset of the [VQAv2](https://visualqa.org/) dataset, with $20$ entries each from the `val`, `test`, and `test-dev` splits, for easier downloading."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "vscode": {
+ "languageId": "shellscript"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "!wget https://huggingface.co/datasets/lmms-lab/VQAv2_TOY/resolve/main/source_data/sample_data.zip -P data\n",
+ "!unzip data/sample_data.zip -d data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can open `data/questions` to take a view of the dataset organization. We found that the toy-`VQAv2` dataset is organized as follows:\n",
+ "\n",
+ "```json\n",
+ "{\n",
+ " \"info\": { /* some infomation */ },\n",
+ " \"task_type\": \"TASK_TYPE\", \"data_type\": \"mscoco\",\n",
+ " \"license\": { /* some license */ },\n",
+ " \"questions\": [\n",
+ " {\n",
+ " \"image_id\": 262144, // integer id of the image\n",
+ " \"question\": \"Is the ball flying towards the batter?\",\n",
+ " \"question_id\": 262144000\n",
+ " },\n",
+ " /* ... */\n",
+ " ]\n",
+ "}\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define Dataset Features _(Optional*)_\n",
+ "\n",
+ "You can define the features of the dataset. For more details, please refer to the [official documentation](https://huggingface.co/docs/datasets/en/about_dataset_features).\n",
+ "\n",
+ "* _Note that if the dataset features are consistent and all entries in your dataset table are non-null **for all splits of data**, you can skip this step._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import datasets\n",
+ "\n",
+ "features = datasets.Features(\n",
+ " {\n",
+ " \"question_id\": datasets.Value(\"int64\"),\n",
+ " \"question\": datasets.Value(\"string\"),\n",
+ " \"image_id\": datasets.Value(\"string\"),\n",
+ " \"image\": datasets.Image(),\n",
+ " }\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define Data Generator\n",
+ "\n",
+ "We use [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.from_generator) to create the dataset.\n",
+ "\n",
+ "The generator function should `yield` dictionaries with the keys corresponding to the dataset features. This can save memory when loading large datasets.\n",
+ "\n",
+ "For the image data, we can convert the image to [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) object.\n",
+ "\n",
+ "Note that if some columns are missing in some splits of the dataset (for example, the `answer` column is usually missing in the `test` split), we need to set these columns to null to ensure that all splits have the same features."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import json\n",
+ "from PIL import Image\n",
+ "\n",
+ "\n",
+ "def generator(qa_file, image_folder, image_prefix):\n",
+ " with open(qa_file, \"r\") as f:\n",
+ " data = json.load(f)\n",
+ " qa = data[\"questions\"]\n",
+ "\n",
+ " for q in qa:\n",
+ " image_id = q[\"image_id\"]\n",
+ " image_path = os.path.join(image_folder, f\"{image_prefix}_{image_id:012}.jpg\")\n",
+ " q[\"image\"] = Image.open(image_path)\n",
+ " yield q"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Generate Dataset\n",
+ "\n",
+ "We generate the dataset using the generator function.\n",
+ "\n",
+ "Note that if you skip the step of defining dataset features, there is no need to pass the `features` argument. The dataset infer the features from the dataset automatically."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "NUM_PROC = 32 # number of processes to use for multiprocessing, set to 1 for no multiprocessing\n",
+ "\n",
+ "data_val = datasets.Dataset.from_generator(\n",
+ " generator,\n",
+ " gen_kwargs={\n",
+ " \"qa_file\": \"data/questions/vqav2_toy_questions_val2014.json\",\n",
+ " \"image_folder\": \"data/images\",\n",
+ " \"image_prefix\": \"COCO_val2014\",\n",
+ " },\n",
+ " # For this dataset, there is no need to specify the features, as all cells are non-null and all splits have the same schema\n",
+ " # features=features,\n",
+ " num_proc=NUM_PROC,\n",
+ ")\n",
+ "\n",
+ "data_test = datasets.Dataset.from_generator(\n",
+ " generator,\n",
+ " gen_kwargs={\n",
+ " \"qa_file\": \"data/questions/vqav2_toy_questions_test2015.json\",\n",
+ " \"image_folder\": \"data/images\",\n",
+ " \"image_prefix\": \"COCO_test2015\",\n",
+ " },\n",
+ " # features=features,\n",
+ " num_proc=NUM_PROC,\n",
+ ")\n",
+ "\n",
+ "data_test_dev = datasets.Dataset.from_generator(\n",
+ " generator,\n",
+ " gen_kwargs={\n",
+ " \"qa_file\": \"data/questions/vqav2_toy_questions_test-dev2015.json\",\n",
+ " \"image_folder\": \"data/images\",\n",
+ " \"image_prefix\": \"COCO_test2015\",\n",
+ " },\n",
+ " # features=features,\n",
+ " num_proc=NUM_PROC,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Dataset Upload\n",
+ "\n",
+ "Finally, we group the dataset with different splits and upload it to the Huggingface dataset hub."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data = datasets.DatasetDict({\"val\": data_val, \"test\": data_test, \"test_dev\": data_test_dev})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data.push_to_hub(\"lmms-lab/VQAv2_TOY\") # replace lmms-lab to your username"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, you can check the dataset on the [Hugging Face dataset hub](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "lmms-eval",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/tools/make_hf_dataset.ipynb b/tools/make_video_hf_dataset.ipynb
similarity index 93%
rename from tools/make_hf_dataset.ipynb
rename to tools/make_video_hf_dataset.ipynb
index 9336331a2..dd801eb81 100755
--- a/tools/make_hf_dataset.ipynb
+++ b/tools/make_video_hf_dataset.ipynb
@@ -162,26 +162,6 @@
"dataset = Dataset.from_pandas(df_items, features=features)"
]
},
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Token is valid (permission: write).\n",
- "Your token has been saved in your configured git credential helpers (store).\n",
- "Your token has been saved to /home/tiger/.cache/huggingface/token\n",
- "Login successful\n"
- ]
- }
- ],
- "source": [
- "!huggingface-cli login --token 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 --add-to-git-credential"
- ]
- },
{
"cell_type": "code",
"execution_count": 9,