ICL / RL_dataset /INFOSEEK_DOWNLOAD.md

Lekr0

Add files using upload-large-folder tool

90afcf2 verified 3 days ago

preview code

raw

history blame contribute delete

9.53 kB

InfoSeek Data Download

This document collects ready-to-run scripts for downloading the InfoSeek dataset into:

/workspace/xiaobin/RL_dataset/data

It covers:

InfoSeek annotations
InfoSeek KB mapping files
InfoSeek human set
Wiki6M text files
OVEN image snapshot on Hugging Face
OVEN original-source image download workflow

InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline.

1. Recommended Directory Layout

mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source

Suggested usage:

/workspace/xiaobin/RL_dataset/data/infoseek: InfoSeek jsonl files
/workspace/xiaobin/RL_dataset/data/oven_hf: Hugging Face image snapshot files
/workspace/xiaobin/RL_dataset/data/oven_source: upstream OVEN repo for original-source image download

2. Proxy Workaround

If your shell is configured with an invalid local proxy such as 127.0.0.1:7890, use one of these patterns.

Temporarily disable proxy for a single command:

env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL

Or disable proxy for the current shell session:

unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY

3. Download All InfoSeek Text Data With `wget`

This is the simplest full download for the released InfoSeek jsonl files.

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

ls -lh "${TARGET_DIR}"

4. Download All InfoSeek Text Data With `curl`

Use this if wget is not available.

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

ls -lh "${TARGET_DIR}"

5. Download Only Core InfoSeek Splits

If you only need the standard train/val/test annotations:

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl

6. Download Only KB Mapping Files

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl

7. Download Only Human Eval Set

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl

8. Download Only Wiki6M Files

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

Optional decompression:

gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz

9. Download OVEN Image Snapshot From Hugging Face

Upstream OVEN now points image snapshot downloads to the gated dataset ychenNLP/oven on Hugging Face. Before downloading:

Open https://huggingface.co/datasets/ychenNLP/oven
Accept the dataset access conditions
Log in with the Hugging Face CLI

Install the CLI if needed:

python -m pip install -U "huggingface_hub[cli]"

hf auth login

Download the image snapshot and mapping file into /workspace/xiaobin/RL_dataset/data/oven_hf:

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"

hf download ychenNLP/oven \
  --repo-type dataset \
  --local-dir "${TARGET_DIR}" \
  --include "shard*.tar" \
  --include "all_wikipedia_images.tar" \
  --include "ovenid2impath.csv"

Extract the snapshot tar files:

#!/usr/bin/env bash
set -euo pipefail

HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images"
mkdir -p "${IMG_DIR}"

for f in "${HF_DIR}"/shard*.tar; do
  tar -xf "${f}" -C "${IMG_DIR}"
done

tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}"

Notes:

Hugging Face file listing shows shard01.tar to shard08.tar plus all_wikipedia_images.tar
The compressed download is very large, roughly 293 GB based on the published file sizes
You need additional free space for extraction

10. Download OVEN Images From Original Sources

This follows the upstream oven_eval/image_downloads workflow.

10.1 Clone the Upstream Repo

git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval

10.2 Run All Source Download Scripts

The upstream image download directory contains these scripts:

download_aircraft.sh
download_car196.sh
download_coco.sh
download_food101.sh
download_gldv2.sh
download_imagenet.sh
download_inat.sh
download_oxfordflower.sh
download_sports100.sh
download_sun397.sh
download_textvqa.sh
download_v7w.sh
download_vg.sh

Run them one by one:

#!/usr/bin/env bash
set -euo pipefail

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

bash download_aircraft.sh
bash download_car196.sh
bash download_coco.sh
bash download_food101.sh
bash download_gldv2.sh
bash download_imagenet.sh
bash download_inat.sh
bash download_oxfordflower.sh
bash download_sports100.sh
bash download_sun397.sh
bash download_textvqa.sh
bash download_v7w.sh
bash download_vg.sh

Or run them in a loop:

#!/usr/bin/env bash
set -euo pipefail

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

for script in download_*.sh; do
  bash "${script}"
done

10.3 Download `ovenid2impath.csv`

You need ovenid2impath.csv for the merge step. The current recommended source is the Hugging Face dataset:

#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"

hf download ychenNLP/oven \
  --repo-type dataset \
  --local-dir "${TARGET_DIR}" \
  --include "ovenid2impath.csv"

10.4 Merge Into the Final OVEN Image Layout

Run the upstream merge script after all downloads finish:

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
python merge_oven_images.py

The upstream documentation states that merge_oven_images.py should be run after all image download scripts complete and after ovenid2impath.csv is available.

11. Verify the Downloaded Files

Check text files:

ls -lh /workspace/xiaobin/RL_dataset/data/infoseek

Check Hugging Face snapshot files:

ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf

Check extracted images:

find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f | wc -l

12. Upstream References

InfoSeek release page: https://github.com/open-vision-language/infoseek
OVEN image download page: https://github.com/edchengg/oven_eval/tree/main/image_downloads
Hugging Face OVEN dataset: https://huggingface.co/datasets/ychenNLP/oven
Hugging Face CLI download docs: https://huggingface.co/docs/huggingface_hub/guides/cli