Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

Add docs for datasets upload to HF
EvolvingLMMs-Lab · Jun 20, 2024 · fce85f1 · fce85f1
2 parents dbe6329 + d94f83c
commit fce85f1
Show file tree

Hide file tree

Showing 3 changed files with 294 additions and 21 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -8,4 +8,5 @@ Majority of this documentation is adapted from [lm-eval-harness](https://github.
 
 * To learn about the command line flags, see the [commands](commands.md)
 * To learn how to add a new moddel,  see the [Model Guide](model_guide.md).
-* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md).
+* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md).
+* If you need to upload your datasets into correct HF format with viewer supported, please refer to [tools](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/pufanyi/hf_dataset_docs/tools)
diff --git a/tools/make_image_hf_dataset.ipynb b/tools/make_image_hf_dataset.ipynb
@@ -0,0 +1,292 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Contribute Your Models and Datasets\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/EvolvingLMMs-Lab/lmms-eval/blob/main/tools/make_image_hf_dataset.ipynb)\n",
+    "\n",
+    "This notebook will guide you to make correct format of Huggingface dataset, in proper parquet format and visualizable in Huggingface dataset hub.\n",
+    "\n",
+    "We will take the example of the dataset [`lmms-lab/VQAv2_TOY`](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY) and convert it to the proper format."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preparation\n",
+    "\n",
+    "We need to install `datasets` library to create the dataset and `Pillow` to handle images."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "!pip install datasets Pillow"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And we need to login into Hugging Face to upload the dataset. You should goto the [Hugging Face website](https://huggingface.co/settings/tokens) to get your API token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "!huggingface-cli login --token hf_YOUR_HF_TOKEN # replace hf_YOUR_HF_TOKEN to your own Hugging Face token."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "source": [
+    "## Download Dataset\n",
+    "\n",
+    "We have uploaded the zip file of the dataset to [Hugging Face](https://huggingface.co/datasets/pufanyi/VQAv2_TOY/tree/main/source_data) for download. This dataset is a subset of the [VQAv2](https://visualqa.org/) dataset, with $20$ entries each from the `val`, `test`, and `test-dev` splits, for easier downloading."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://huggingface.co/datasets/lmms-lab/VQAv2_TOY/resolve/main/source_data/sample_data.zip -P data\n",
+    "!unzip data/sample_data.zip -d data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can open `data/questions` to take a view of the dataset organization. We found that the toy-`VQAv2` dataset is organized as follows:\n",
+    "\n",
+    "```json\n",
+    "{\n",
+    "    \"info\": { /* some infomation */ },\n",
+    "    \"task_type\": \"TASK_TYPE\", \"data_type\": \"mscoco\",\n",
+    "    \"license\": { /* some license */ },\n",
+    "    \"questions\": [\n",
+    "        {\n",
+    "            \"image_id\": 262144, // integer id of the image\n",
+    "            \"question\": \"Is the ball flying towards the batter?\",\n",
+    "            \"question_id\": 262144000\n",
+    "        },\n",
+    "        /* ... */\n",
+    "    ]\n",
+    "}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define Dataset Features _(Optional<sup>*</sup>)_\n",
+    "\n",
+    "You can define the features of the dataset. For more details, please refer to the [official documentation](https://huggingface.co/docs/datasets/en/about_dataset_features).\n",
+    "\n",
+    "<sup>*</sup> _Note that if the dataset features are consistent and all entries in your dataset table are non-null **for all splits of data**, you can skip this step._"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import datasets\n",
+    "\n",
+    "features = datasets.Features(\n",
+    "    {\n",
+    "        \"question_id\": datasets.Value(\"int64\"),\n",
+    "        \"question\": datasets.Value(\"string\"),\n",
+    "        \"image_id\": datasets.Value(\"string\"),\n",
+    "        \"image\": datasets.Image(),\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define Data Generator\n",
+    "\n",
+    "We use [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.from_generator) to create the dataset.\n",
+    "\n",
+    "The generator function should `yield` dictionaries with the keys corresponding to the dataset features. This can save memory when loading large datasets.\n",
+    "\n",
+    "For the image data, we can convert the image to [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) object.\n",
+    "\n",
+    "Note that if some columns are missing in some splits of the dataset (for example, the `answer` column is usually missing in the `test` split), we need to set these columns to null to ensure that all splits have the same features."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "from PIL import Image\n",
+    "\n",
+    "\n",
+    "def generator(qa_file, image_folder, image_prefix):\n",
+    "    with open(qa_file, \"r\") as f:\n",
+    "        data = json.load(f)\n",
+    "        qa = data[\"questions\"]\n",
+    "\n",
+    "    for q in qa:\n",
+    "        image_id = q[\"image_id\"]\n",
+    "        image_path = os.path.join(image_folder, f\"{image_prefix}_{image_id:012}.jpg\")\n",
+    "        q[\"image\"] = Image.open(image_path)\n",
+    "        yield q"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Generate Dataset\n",
+    "\n",
+    "We generate the dataset using the generator function.\n",
+    "\n",
+    "Note that if you skip the step of defining dataset features, there is no need to pass the `features` argument. The dataset infer the features from the dataset automatically."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "NUM_PROC = 32  # number of processes to use for multiprocessing, set to 1 for no multiprocessing\n",
+    "\n",
+    "data_val = datasets.Dataset.from_generator(\n",
+    "    generator,\n",
+    "    gen_kwargs={\n",
+    "        \"qa_file\": \"data/questions/vqav2_toy_questions_val2014.json\",\n",
+    "        \"image_folder\": \"data/images\",\n",
+    "        \"image_prefix\": \"COCO_val2014\",\n",
+    "    },\n",
+    "    # For this dataset, there is no need to specify the features, as all cells are non-null and all splits have the same schema\n",
+    "    # features=features,\n",
+    "    num_proc=NUM_PROC,\n",
+    ")\n",
+    "\n",
+    "data_test = datasets.Dataset.from_generator(\n",
+    "    generator,\n",
+    "    gen_kwargs={\n",
+    "        \"qa_file\": \"data/questions/vqav2_toy_questions_test2015.json\",\n",
+    "        \"image_folder\": \"data/images\",\n",
+    "        \"image_prefix\": \"COCO_test2015\",\n",
+    "    },\n",
+    "    # features=features,\n",
+    "    num_proc=NUM_PROC,\n",
+    ")\n",
+    "\n",
+    "data_test_dev = datasets.Dataset.from_generator(\n",
+    "    generator,\n",
+    "    gen_kwargs={\n",
+    "        \"qa_file\": \"data/questions/vqav2_toy_questions_test-dev2015.json\",\n",
+    "        \"image_folder\": \"data/images\",\n",
+    "        \"image_prefix\": \"COCO_test2015\",\n",
+    "    },\n",
+    "    # features=features,\n",
+    "    num_proc=NUM_PROC,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Dataset Upload\n",
+    "\n",
+    "Finally, we group the dataset with different splits and upload it to the Huggingface dataset hub."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = datasets.DatasetDict({\"val\": data_val, \"test\": data_test, \"test_dev\": data_test_dev})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.push_to_hub(\"lmms-lab/VQAv2_TOY\")  # replace lmms-lab to your username"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, you can check the dataset on the [Hugging Face dataset hub](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "lmms-eval",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/tools/make_hf_dataset.ipynb → tools/make_video_hf_dataset.ipynb b/tools/make_hf_dataset.ipynb → tools/make_video_hf_dataset.ipynb
@@ -162,26 +162,6 @@
     "dataset = Dataset.from_pandas(df_items, features=features)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Token is valid (permission: write).\n",
-      "Your token has been saved in your configured git credential helpers (store).\n",
-      "Your token has been saved to /home/tiger/.cache/huggingface/token\n",
-      "Login successful\n"
-     ]
-    }
-   ],
-   "source": [
-    "!huggingface-cli login --token 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 --add-to-git-credential"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 9,