# InfoSeek Data Download This document collects ready-to-run scripts for downloading the InfoSeek dataset into: `/workspace/xiaobin/RL_dataset/data` It covers: - InfoSeek annotations - InfoSeek KB mapping files - InfoSeek human set - Wiki6M text files - OVEN image snapshot on Hugging Face - OVEN original-source image download workflow InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline. ## 1. Recommended Directory Layout ```bash mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source ``` Suggested usage: - `/workspace/xiaobin/RL_dataset/data/infoseek`: InfoSeek jsonl files - `/workspace/xiaobin/RL_dataset/data/oven_hf`: Hugging Face image snapshot files - `/workspace/xiaobin/RL_dataset/data/oven_source`: upstream OVEN repo for original-source image download ## 2. Proxy Workaround If your shell is configured with an invalid local proxy such as `127.0.0.1:7890`, use one of these patterns. Temporarily disable proxy for a single command: ```bash env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL ``` Or disable proxy for the current shell session: ```bash unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY ``` ## 3. Download All InfoSeek Text Data With `wget` This is the simplest full download for the released InfoSeek jsonl files. ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" mkdir -p "${TARGET_DIR}" cd "${TARGET_DIR}" wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl ls -lh "${TARGET_DIR}" ``` ## 4. Download All InfoSeek Text Data With `curl` Use this if `wget` is not available. ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" mkdir -p "${TARGET_DIR}" cd "${TARGET_DIR}" curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl ls -lh "${TARGET_DIR}" ``` ## 5. Download Only Core InfoSeek Splits If you only need the standard train/val/test annotations: ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" mkdir -p "${TARGET_DIR}" cd "${TARGET_DIR}" wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl ``` ## 6. Download Only KB Mapping Files ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" mkdir -p "${TARGET_DIR}" cd "${TARGET_DIR}" wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl ``` ## 7. Download Only Human Eval Set ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" mkdir -p "${TARGET_DIR}" cd "${TARGET_DIR}" wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl ``` ## 8. Download Only Wiki6M Files ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" mkdir -p "${TARGET_DIR}" cd "${TARGET_DIR}" wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl ``` Optional decompression: ```bash gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz ``` ## 9. Download OVEN Image Snapshot From Hugging Face Upstream OVEN now points image snapshot downloads to the gated dataset `ychenNLP/oven` on Hugging Face. Before downloading: 1. Open `https://huggingface.co/datasets/ychenNLP/oven` 2. Accept the dataset access conditions 3. Log in with the Hugging Face CLI Install the CLI if needed: ```bash python -m pip install -U "huggingface_hub[cli]" ``` Login: ```bash hf auth login ``` Download the image snapshot and mapping file into `/workspace/xiaobin/RL_dataset/data/oven_hf`: ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf" mkdir -p "${TARGET_DIR}" hf download ychenNLP/oven \ --repo-type dataset \ --local-dir "${TARGET_DIR}" \ --include "shard*.tar" \ --include "all_wikipedia_images.tar" \ --include "ovenid2impath.csv" ``` Extract the snapshot tar files: ```bash #!/usr/bin/env bash set -euo pipefail HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf" IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images" mkdir -p "${IMG_DIR}" for f in "${HF_DIR}"/shard*.tar; do tar -xf "${f}" -C "${IMG_DIR}" done tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}" ``` Notes: - Hugging Face file listing shows `shard01.tar` to `shard08.tar` plus `all_wikipedia_images.tar` - The compressed download is very large, roughly 293 GB based on the published file sizes - You need additional free space for extraction ## 10. Download OVEN Images From Original Sources This follows the upstream `oven_eval/image_downloads` workflow. ### 10.1 Clone the Upstream Repo ```bash git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval ``` ### 10.2 Run All Source Download Scripts The upstream image download directory contains these scripts: - `download_aircraft.sh` - `download_car196.sh` - `download_coco.sh` - `download_food101.sh` - `download_gldv2.sh` - `download_imagenet.sh` - `download_inat.sh` - `download_oxfordflower.sh` - `download_sports100.sh` - `download_sun397.sh` - `download_textvqa.sh` - `download_v7w.sh` - `download_vg.sh` Run them one by one: ```bash #!/usr/bin/env bash set -euo pipefail cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads bash download_aircraft.sh bash download_car196.sh bash download_coco.sh bash download_food101.sh bash download_gldv2.sh bash download_imagenet.sh bash download_inat.sh bash download_oxfordflower.sh bash download_sports100.sh bash download_sun397.sh bash download_textvqa.sh bash download_v7w.sh bash download_vg.sh ``` Or run them in a loop: ```bash #!/usr/bin/env bash set -euo pipefail cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads for script in download_*.sh; do bash "${script}" done ``` ### 10.3 Download `ovenid2impath.csv` You need `ovenid2impath.csv` for the merge step. The current recommended source is the Hugging Face dataset: ```bash #!/usr/bin/env bash set -euo pipefail TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf" mkdir -p "${TARGET_DIR}" hf download ychenNLP/oven \ --repo-type dataset \ --local-dir "${TARGET_DIR}" \ --include "ovenid2impath.csv" ``` ### 10.4 Merge Into the Final OVEN Image Layout Run the upstream merge script after all downloads finish: ```bash cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads python merge_oven_images.py ``` The upstream documentation states that `merge_oven_images.py` should be run after all image download scripts complete and after `ovenid2impath.csv` is available. ## 11. Verify the Downloaded Files Check text files: ```bash ls -lh /workspace/xiaobin/RL_dataset/data/infoseek ``` Check Hugging Face snapshot files: ```bash ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf ``` Check extracted images: ```bash find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f | wc -l ``` ## 12. Upstream References - InfoSeek release page: `https://github.com/open-vision-language/infoseek` - OVEN image download page: `https://github.com/edchengg/oven_eval/tree/main/image_downloads` - Hugging Face OVEN dataset: `https://huggingface.co/datasets/ychenNLP/oven` - Hugging Face CLI download docs: `https://huggingface.co/docs/huggingface_hub/guides/cli`