InfoSeek Data Download
This document collects ready-to-run scripts for downloading the InfoSeek dataset into:
/workspace/xiaobin/RL_dataset/data
It covers:
- InfoSeek annotations
- InfoSeek KB mapping files
- InfoSeek human set
- Wiki6M text files
- OVEN image snapshot on Hugging Face
- OVEN original-source image download workflow
InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline.
1. Recommended Directory Layout
mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source
Suggested usage:
/workspace/xiaobin/RL_dataset/data/infoseek: InfoSeek jsonl files/workspace/xiaobin/RL_dataset/data/oven_hf: Hugging Face image snapshot files/workspace/xiaobin/RL_dataset/data/oven_source: upstream OVEN repo for original-source image download
2. Proxy Workaround
If your shell is configured with an invalid local proxy such as 127.0.0.1:7890, use one of these patterns.
Temporarily disable proxy for a single command:
env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL
Or disable proxy for the current shell session:
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
3. Download All InfoSeek Text Data With wget
This is the simplest full download for the released InfoSeek jsonl files.
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
ls -lh "${TARGET_DIR}"
4. Download All InfoSeek Text Data With curl
Use this if wget is not available.
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
ls -lh "${TARGET_DIR}"
5. Download Only Core InfoSeek Splits
If you only need the standard train/val/test annotations:
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
6. Download Only KB Mapping Files
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
7. Download Only Human Eval Set
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
8. Download Only Wiki6M Files
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
Optional decompression:
gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz
9. Download OVEN Image Snapshot From Hugging Face
Upstream OVEN now points image snapshot downloads to the gated dataset ychenNLP/oven on Hugging Face. Before downloading:
- Open
https://huggingface.co/datasets/ychenNLP/oven - Accept the dataset access conditions
- Log in with the Hugging Face CLI
Install the CLI if needed:
python -m pip install -U "huggingface_hub[cli]"
Login:
hf auth login
Download the image snapshot and mapping file into /workspace/xiaobin/RL_dataset/data/oven_hf:
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"
hf download ychenNLP/oven \
--repo-type dataset \
--local-dir "${TARGET_DIR}" \
--include "shard*.tar" \
--include "all_wikipedia_images.tar" \
--include "ovenid2impath.csv"
Extract the snapshot tar files:
#!/usr/bin/env bash
set -euo pipefail
HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images"
mkdir -p "${IMG_DIR}"
for f in "${HF_DIR}"/shard*.tar; do
tar -xf "${f}" -C "${IMG_DIR}"
done
tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}"
Notes:
- Hugging Face file listing shows
shard01.tartoshard08.tarplusall_wikipedia_images.tar - The compressed download is very large, roughly 293 GB based on the published file sizes
- You need additional free space for extraction
10. Download OVEN Images From Original Sources
This follows the upstream oven_eval/image_downloads workflow.
10.1 Clone the Upstream Repo
git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval
10.2 Run All Source Download Scripts
The upstream image download directory contains these scripts:
download_aircraft.shdownload_car196.shdownload_coco.shdownload_food101.shdownload_gldv2.shdownload_imagenet.shdownload_inat.shdownload_oxfordflower.shdownload_sports100.shdownload_sun397.shdownload_textvqa.shdownload_v7w.shdownload_vg.sh
Run them one by one:
#!/usr/bin/env bash
set -euo pipefail
cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
bash download_aircraft.sh
bash download_car196.sh
bash download_coco.sh
bash download_food101.sh
bash download_gldv2.sh
bash download_imagenet.sh
bash download_inat.sh
bash download_oxfordflower.sh
bash download_sports100.sh
bash download_sun397.sh
bash download_textvqa.sh
bash download_v7w.sh
bash download_vg.sh
Or run them in a loop:
#!/usr/bin/env bash
set -euo pipefail
cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
for script in download_*.sh; do
bash "${script}"
done
10.3 Download ovenid2impath.csv
You need ovenid2impath.csv for the merge step. The current recommended source is the Hugging Face dataset:
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"
hf download ychenNLP/oven \
--repo-type dataset \
--local-dir "${TARGET_DIR}" \
--include "ovenid2impath.csv"
10.4 Merge Into the Final OVEN Image Layout
Run the upstream merge script after all downloads finish:
cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
python merge_oven_images.py
The upstream documentation states that merge_oven_images.py should be run after all image download scripts complete and after ovenid2impath.csv is available.
11. Verify the Downloaded Files
Check text files:
ls -lh /workspace/xiaobin/RL_dataset/data/infoseek
Check Hugging Face snapshot files:
ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf
Check extracted images:
find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f | wc -l
12. Upstream References
- InfoSeek release page:
https://github.com/open-vision-language/infoseek - OVEN image download page:
https://github.com/edchengg/oven_eval/tree/main/image_downloads - Hugging Face OVEN dataset:
https://huggingface.co/datasets/ychenNLP/oven - Hugging Face CLI download docs:
https://huggingface.co/docs/huggingface_hub/guides/cli