ICL / RL_dataset /INFOSEEK_DOWNLOAD.md
Lekr0's picture
Add files using upload-large-folder tool
90afcf2 verified

InfoSeek Data Download

This document collects ready-to-run scripts for downloading the InfoSeek dataset into:

/workspace/xiaobin/RL_dataset/data

It covers:

  • InfoSeek annotations
  • InfoSeek KB mapping files
  • InfoSeek human set
  • Wiki6M text files
  • OVEN image snapshot on Hugging Face
  • OVEN original-source image download workflow

InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline.

1. Recommended Directory Layout

mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source

Suggested usage:

  • /workspace/xiaobin/RL_dataset/data/infoseek: InfoSeek jsonl files
  • /workspace/xiaobin/RL_dataset/data/oven_hf: Hugging Face image snapshot files
  • /workspace/xiaobin/RL_dataset/data/oven_source: upstream OVEN repo for original-source image download

2. Proxy Workaround

If your shell is configured with an invalid local proxy such as 127.0.0.1:7890, use one of these patterns.

Temporarily disable proxy for a single command:

env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL

Or disable proxy for the current shell session:

unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY

3. Download All InfoSeek Text Data With wget

This is the simplest full download for the released InfoSeek jsonl files.

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

ls -lh "${TARGET_DIR}"

4. Download All InfoSeek Text Data With curl

Use this if wget is not available.

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

ls -lh "${TARGET_DIR}"

5. Download Only Core InfoSeek Splits

If you only need the standard train/val/test annotations:

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl

6. Download Only KB Mapping Files

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl

7. Download Only Human Eval Set

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl

8. Download Only Wiki6M Files

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

Optional decompression:

gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz

9. Download OVEN Image Snapshot From Hugging Face

Upstream OVEN now points image snapshot downloads to the gated dataset ychenNLP/oven on Hugging Face. Before downloading:

  1. Open https://huggingface.co/datasets/ychenNLP/oven
  2. Accept the dataset access conditions
  3. Log in with the Hugging Face CLI

Install the CLI if needed:

python -m pip install -U "huggingface_hub[cli]"

Login:

hf auth login

Download the image snapshot and mapping file into /workspace/xiaobin/RL_dataset/data/oven_hf:

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"

hf download ychenNLP/oven \
  --repo-type dataset \
  --local-dir "${TARGET_DIR}" \
  --include "shard*.tar" \
  --include "all_wikipedia_images.tar" \
  --include "ovenid2impath.csv"

Extract the snapshot tar files:

#!/usr/bin/env bash
set -euo pipefail

HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images"
mkdir -p "${IMG_DIR}"

for f in "${HF_DIR}"/shard*.tar; do
  tar -xf "${f}" -C "${IMG_DIR}"
done

tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}"

Notes:

  • Hugging Face file listing shows shard01.tar to shard08.tar plus all_wikipedia_images.tar
  • The compressed download is very large, roughly 293 GB based on the published file sizes
  • You need additional free space for extraction

10. Download OVEN Images From Original Sources

This follows the upstream oven_eval/image_downloads workflow.

10.1 Clone the Upstream Repo

git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval

10.2 Run All Source Download Scripts

The upstream image download directory contains these scripts:

  • download_aircraft.sh
  • download_car196.sh
  • download_coco.sh
  • download_food101.sh
  • download_gldv2.sh
  • download_imagenet.sh
  • download_inat.sh
  • download_oxfordflower.sh
  • download_sports100.sh
  • download_sun397.sh
  • download_textvqa.sh
  • download_v7w.sh
  • download_vg.sh

Run them one by one:

#!/usr/bin/env bash
set -euo pipefail

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

bash download_aircraft.sh
bash download_car196.sh
bash download_coco.sh
bash download_food101.sh
bash download_gldv2.sh
bash download_imagenet.sh
bash download_inat.sh
bash download_oxfordflower.sh
bash download_sports100.sh
bash download_sun397.sh
bash download_textvqa.sh
bash download_v7w.sh
bash download_vg.sh

Or run them in a loop:

#!/usr/bin/env bash
set -euo pipefail

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

for script in download_*.sh; do
  bash "${script}"
done

10.3 Download ovenid2impath.csv

You need ovenid2impath.csv for the merge step. The current recommended source is the Hugging Face dataset:

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"

hf download ychenNLP/oven \
  --repo-type dataset \
  --local-dir "${TARGET_DIR}" \
  --include "ovenid2impath.csv"

10.4 Merge Into the Final OVEN Image Layout

Run the upstream merge script after all downloads finish:

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
python merge_oven_images.py

The upstream documentation states that merge_oven_images.py should be run after all image download scripts complete and after ovenid2impath.csv is available.

11. Verify the Downloaded Files

Check text files:

ls -lh /workspace/xiaobin/RL_dataset/data/infoseek

Check Hugging Face snapshot files:

ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf

Check extracted images:

find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f | wc -l

12. Upstream References

  • InfoSeek release page: https://github.com/open-vision-language/infoseek
  • OVEN image download page: https://github.com/edchengg/oven_eval/tree/main/image_downloads
  • Hugging Face OVEN dataset: https://huggingface.co/datasets/ychenNLP/oven
  • Hugging Face CLI download docs: https://huggingface.co/docs/huggingface_hub/guides/cli