diff --git a/docs/README.md b/docs/README.md index 020e351bc..2c9123632 100755 --- a/docs/README.md +++ b/docs/README.md @@ -8,4 +8,5 @@ Majority of this documentation is adapted from [lm-eval-harness](https://github. * To learn about the command line flags, see the [commands](commands.md) * To learn how to add a new moddel, see the [Model Guide](model_guide.md). -* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md). \ No newline at end of file +* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md). +* If you need to upload your datasets into correct HF format with viewer supported, please refer to [tools](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/pufanyi/hf_dataset_docs/tools) diff --git a/tools/make_image_hf_dataset.ipynb b/tools/make_image_hf_dataset.ipynb new file mode 100755 index 000000000..89fa0a71f --- /dev/null +++ b/tools/make_image_hf_dataset.ipynb @@ -0,0 +1,292 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Contribute Your Models and Datasets\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/EvolvingLMMs-Lab/lmms-eval/blob/main/tools/make_image_hf_dataset.ipynb)\n", + "\n", + "This notebook will guide you to make correct format of Huggingface dataset, in proper parquet format and visualizable in Huggingface dataset hub.\n", + "\n", + "We will take the example of the dataset [`lmms-lab/VQAv2_TOY`](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY) and convert it to the proper format." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparation\n", + "\n", + "We need to install `datasets` library to create the dataset and `Pillow` to handle images." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!pip install datasets Pillow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And we need to login into Hugging Face to upload the dataset. You should goto the [Hugging Face website](https://huggingface.co/settings/tokens) to get your API token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!huggingface-cli login --token hf_YOUR_HF_TOKEN # replace hf_YOUR_HF_TOKEN to your own Hugging Face token." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "source": [ + "## Download Dataset\n", + "\n", + "We have uploaded the zip file of the dataset to [Hugging Face](https://huggingface.co/datasets/pufanyi/VQAv2_TOY/tree/main/source_data) for download. This dataset is a subset of the [VQAv2](https://visualqa.org/) dataset, with $20$ entries each from the `val`, `test`, and `test-dev` splits, for easier downloading." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!wget https://huggingface.co/datasets/lmms-lab/VQAv2_TOY/resolve/main/source_data/sample_data.zip -P data\n", + "!unzip data/sample_data.zip -d data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can open `data/questions` to take a view of the dataset organization. We found that the toy-`VQAv2` dataset is organized as follows:\n", + "\n", + "```json\n", + "{\n", + " \"info\": { /* some infomation */ },\n", + " \"task_type\": \"TASK_TYPE\", \"data_type\": \"mscoco\",\n", + " \"license\": { /* some license */ },\n", + " \"questions\": [\n", + " {\n", + " \"image_id\": 262144, // integer id of the image\n", + " \"question\": \"Is the ball flying towards the batter?\",\n", + " \"question_id\": 262144000\n", + " },\n", + " /* ... */\n", + " ]\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Dataset Features _(Optional*)_\n", + "\n", + "You can define the features of the dataset. For more details, please refer to the [official documentation](https://huggingface.co/docs/datasets/en/about_dataset_features).\n", + "\n", + "* _Note that if the dataset features are consistent and all entries in your dataset table are non-null **for all splits of data**, you can skip this step._" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import datasets\n", + "\n", + "features = datasets.Features(\n", + " {\n", + " \"question_id\": datasets.Value(\"int64\"),\n", + " \"question\": datasets.Value(\"string\"),\n", + " \"image_id\": datasets.Value(\"string\"),\n", + " \"image\": datasets.Image(),\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Data Generator\n", + "\n", + "We use [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.from_generator) to create the dataset.\n", + "\n", + "The generator function should `yield` dictionaries with the keys corresponding to the dataset features. This can save memory when loading large datasets.\n", + "\n", + "For the image data, we can convert the image to [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) object.\n", + "\n", + "Note that if some columns are missing in some splits of the dataset (for example, the `answer` column is usually missing in the `test` split), we need to set these columns to null to ensure that all splits have the same features." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json\n", + "from PIL import Image\n", + "\n", + "\n", + "def generator(qa_file, image_folder, image_prefix):\n", + " with open(qa_file, \"r\") as f:\n", + " data = json.load(f)\n", + " qa = data[\"questions\"]\n", + "\n", + " for q in qa:\n", + " image_id = q[\"image_id\"]\n", + " image_path = os.path.join(image_folder, f\"{image_prefix}_{image_id:012}.jpg\")\n", + " q[\"image\"] = Image.open(image_path)\n", + " yield q" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generate Dataset\n", + "\n", + "We generate the dataset using the generator function.\n", + "\n", + "Note that if you skip the step of defining dataset features, there is no need to pass the `features` argument. The dataset infer the features from the dataset automatically." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "NUM_PROC = 32 # number of processes to use for multiprocessing, set to 1 for no multiprocessing\n", + "\n", + "data_val = datasets.Dataset.from_generator(\n", + " generator,\n", + " gen_kwargs={\n", + " \"qa_file\": \"data/questions/vqav2_toy_questions_val2014.json\",\n", + " \"image_folder\": \"data/images\",\n", + " \"image_prefix\": \"COCO_val2014\",\n", + " },\n", + " # For this dataset, there is no need to specify the features, as all cells are non-null and all splits have the same schema\n", + " # features=features,\n", + " num_proc=NUM_PROC,\n", + ")\n", + "\n", + "data_test = datasets.Dataset.from_generator(\n", + " generator,\n", + " gen_kwargs={\n", + " \"qa_file\": \"data/questions/vqav2_toy_questions_test2015.json\",\n", + " \"image_folder\": \"data/images\",\n", + " \"image_prefix\": \"COCO_test2015\",\n", + " },\n", + " # features=features,\n", + " num_proc=NUM_PROC,\n", + ")\n", + "\n", + "data_test_dev = datasets.Dataset.from_generator(\n", + " generator,\n", + " gen_kwargs={\n", + " \"qa_file\": \"data/questions/vqav2_toy_questions_test-dev2015.json\",\n", + " \"image_folder\": \"data/images\",\n", + " \"image_prefix\": \"COCO_test2015\",\n", + " },\n", + " # features=features,\n", + " num_proc=NUM_PROC,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset Upload\n", + "\n", + "Finally, we group the dataset with different splits and upload it to the Huggingface dataset hub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = datasets.DatasetDict({\"val\": data_val, \"test\": data_test, \"test_dev\": data_test_dev})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.push_to_hub(\"lmms-lab/VQAv2_TOY\") # replace lmms-lab to your username" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, you can check the dataset on the [Hugging Face dataset hub](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "lmms-eval", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/tools/make_hf_dataset.ipynb b/tools/make_video_hf_dataset.ipynb similarity index 93% rename from tools/make_hf_dataset.ipynb rename to tools/make_video_hf_dataset.ipynb index 9336331a2..dd801eb81 100755 --- a/tools/make_hf_dataset.ipynb +++ b/tools/make_video_hf_dataset.ipynb @@ -162,26 +162,6 @@ "dataset = Dataset.from_pandas(df_items, features=features)" ] }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Token is valid (permission: write).\n", - "Your token has been saved in your configured git credential helpers (store).\n", - "Your token has been saved to /home/tiger/.cache/huggingface/token\n", - "Login successful\n" - ] - } - ], - "source": [ - "!huggingface-cli login --token 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 --add-to-git-credential" - ] - }, { "cell_type": "code", "execution_count": 9,