sql_env / docs /data-sources.md
hjerpe's picture
Upload folder using huggingface_hub
a001a97 verified

Data Sources

Reference for what data SQLEnv uses, where it comes from, and how to regenerate it. All data lives under data/ and is checked into the repo, so a fresh clone works offline after uv sync.

Summary

Artifact Path Origin Count
SQLite databases data/databases/<db_id>/<db_id>.sqlite Spider (taoyds/spider on GitHub) 10 databases
Training questions data/questions/questions_train.json Spider train split (xlangai/spider) 473 questions
Eval questions data/questions/questions_eval.json Spider validation split (xlangai/spider) 203 questions
DB allowlist data/questions/db_list.json hand-curated subset 10 db_ids
SFT trajectories data/sft/sft_trajectories.json generated from gold SQL 120 trajectories

Total: 676 questions across 10 Spider databases, plus 120 multi-turn SFT warmup trajectories.

Upstream: Spider

Spider (Yu et al., EMNLP 2018) is a cross-domain text-to-SQL benchmark with ~200 databases and ~10k question/gold-SQL pairs. Every question has a natural-language prompt, a gold SQL query, and a target database. We use two mirrors:

  1. Questions via HuggingFace Datasets: xlangai/spider β€” loaded with datasets.load_dataset("xlangai/spider", split=...) in scripts/download_spider_questions.py.
  2. SQLite databases via the Spider GitHub mirror:
    • https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite
    • Fallback: the official Google Drive archive (1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J) β€” fetched by scripts/download_spider_databases.py.

Spider's license is CC BY-SA 4.0.

The 10-database subset (db_list.json)

We do not ship all ~200 Spider databases. The allowlist in data/questions/db_list.json pins 10:

student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation,
pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player

These were chosen for schema variety (single-table aggregates, 2-table joins, 3-table FK chains, naming-convention quirks) while keeping the repo small and the training loop fast.

Train/eval split is by database, not random sampling within a database. This prevents train/eval leakage at the schema level:

  • Train databases (7): car_1, concert_singer, cre_Doc_Template_Mgt, dog_kennels, employee_hire_evaluation, flight_2, student_assessment
  • Eval databases (4): flight_2, pets_1, poker_player, world_1

flight_2 appears in both. The other eval DBs are schemas the model never saw during training. sql_env.training.data_loading.validate_no_data_leak asserts zero question-text overlap between the two files at load time.

Question files

Both questions_train.json and questions_eval.json are lists of records with this shape (actual sample from car_1 train):

{
  "question_text": "How many cars have a larger accelerate than the car with the largest horsepower?",
  "database_name": "car_1",
  "gold_sql": "SELECT COUNT(*) FROM CARS_DATA WHERE Accelerate > (SELECT Accelerate FROM CARS_DATA ORDER BY Horsepower DESC LIMIT 1)",
  "gold_answer": 39,
  "answer_type": "integer",
  "difficulty": "easy",
  "tables_involved": ["CARS_DATA"],
  "split": "train",
  "question_id": "car_1_train_000"
}

Counts and difficulty mix

Split Total easy medium hard
train 473 435 32 6
eval 203 185 18 0

The easy-heavy distribution is deliberate for the 0.6B capacity ceiling. Extended GRPO training on harder questions produced identical accuracy, which indicates the ceiling comes from pretraining knowledge rather than training budget. Medium and hard questions stay in the mix for Phase 2 exposure but are not where this model size gains accuracy. See the "Limitations at 0.6B Parameters" section of the blog post.

Curation pipeline

scripts/curate_questions.py turns raw Spider records into the format above. Per question, it:

  1. Filters to databases in db_list.json.
  2. Executes gold_sql against the real SQLite database to produce gold_answer β€” this is what the environment grades agents against, not a string match on the SQL.
  3. Normalizes the answer into a typed shape (integer, string, list[...], table) via answer_type.
  4. Parses FROM and JOIN tokens to fill tables_involved (used by SFT generation to decide which tables to describe()).
  5. Assigns the Spider-provided difficulty label.
  6. Writes train and eval to separate files with a per-question question_id derived from {db_id}_{split}_{index}.

Re-running the script is idempotent given the same Spider snapshot.

SFT warmup trajectories (data/sft/sft_trajectories.json)

120 multi-turn trajectories used as a supervised warmup before GRPO. Each record has messages (tool-calling chat format) and tools (the four SQLEnv tool schemas).

How they are generated

scripts/generate_sft_data.py walks the training questions and, for each one, runs the real SQLEnvironment programmatically:

  1. describe(table) for every table in tables_involved
  2. query(gold_sql) β€” captures real tabular output from the environment
  3. answer(gold_answer) β€” terminal step

The captured sequence becomes an assistant-labelled trajectory. This is not synthetic text. The assistant turns wrap the actual environment responses the model will see at training and inference time, so the SFT-warmed reference policy already expects real env output when GRPO takes over.

SFT uses 120 trajectories rather than one per training question. The subset is chosen to cover each database and difficulty bucket. See scripts/generate_sft_data.py for the selection logic.

Why multi-turn matters: an earlier per-turn SFT (347 single-turn examples) taught the model to always call describe. Half those examples were describe calls, so the model learned "when asked a question, call describe." Under a KL penalty during GRPO, every rollout stayed identical, the advantage between rollouts was zero, and no policy gradient could form. Multi-turn SFT (120 full trajectories trained with assistant_only_loss) instead teaches the full describe β†’ query β†’ answer sequence as a coherent strategy, which GRPO then refines into error recovery and answer formatting. See the "Training" section of the blog post.

How to regenerate from scratch

# 1. Databases (SQLite files from the Spider mirror)
uv run python scripts/download_spider_databases.py --db-id all

# 2. Raw Spider questions (via HF Datasets)
uv run python scripts/download_spider_questions.py --db-id all --split train
uv run python scripts/download_spider_questions.py --db-id all --split validation

# 3. Curate into questions_train.json / questions_eval.json
uv run python scripts/curate_questions.py

# 4. Regenerate SFT trajectories from gold SQL
uv run python scripts/generate_sft_data.py

You should not need to do this in normal operation β€” the curated files are committed. Regenerate only when updating the database allowlist, changing the answer-type taxonomy, or rerunning against a new Spider snapshot.

What we deliberately do not use

  • BIRD (Li et al., 2023) β€” larger, harder text-to-SQL benchmark. Out of scope for a 0.6B model. Revisit for a larger-model follow-up.
  • WikiSQL β€” single-table only, doesn't exercise the multi-turn exploration the environment is built for.
  • Synthetic LLM-generated questions β€” we want Spider's human-written prompts so eval results are comparable to published work.
  • Spider databases outside db_list.json β€” kept out to keep the repo small and training fast. Easy to widen by editing the list and rerunning the regeneration pipeline.

References