sql_env / docs /data-sources.md
hjerpe's picture
Upload folder using huggingface_hub
a001a97 verified
# Data Sources
Reference for what data SQLEnv uses, where it comes from, and how to
regenerate it. All data lives under `data/` and is checked into the repo,
so a fresh clone works offline after `uv sync`.
## Summary
| Artifact | Path | Origin | Count |
|---|---|---|---|
| SQLite databases | `data/databases/<db_id>/<db_id>.sqlite` | [Spider](https://yale-lily.github.io/spider) (taoyds/spider on GitHub) | 10 databases |
| Training questions | `data/questions/questions_train.json` | Spider train split (`xlangai/spider`) | 473 questions |
| Eval questions | `data/questions/questions_eval.json` | Spider validation split (`xlangai/spider`) | 203 questions |
| DB allowlist | `data/questions/db_list.json` | hand-curated subset | 10 db_ids |
| SFT trajectories | `data/sft/sft_trajectories.json` | generated from gold SQL | 120 trajectories |
Total: 676 questions across 10 Spider databases, plus 120 multi-turn SFT
warmup trajectories.
## Upstream: Spider
[Spider](https://yale-lily.github.io/spider) (Yu et al., EMNLP 2018) is a
cross-domain text-to-SQL benchmark with ~200 databases and ~10k
question/gold-SQL pairs. Every question has a natural-language prompt, a
gold SQL query, and a target database. We use two mirrors:
1. **Questions** via HuggingFace Datasets: [`xlangai/spider`](https://huggingface.co/datasets/xlangai/spider)
β€” loaded with `datasets.load_dataset("xlangai/spider", split=...)` in
`scripts/download_spider_questions.py`.
2. **SQLite databases** via the Spider GitHub mirror:
- `https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite`
- Fallback: the official Google Drive archive
(`1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J`)
β€” fetched by `scripts/download_spider_databases.py`.
Spider's license is CC BY-SA 4.0.
## The 10-database subset (`db_list.json`)
We do not ship all ~200 Spider databases. The allowlist in
`data/questions/db_list.json` pins 10:
```
student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation,
pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player
```
These were chosen for schema variety (single-table aggregates, 2-table
joins, 3-table FK chains, naming-convention quirks) while keeping the
repo small and the training loop fast.
**Train/eval split is by database**, not random sampling within a
database. This prevents train/eval leakage at the schema level:
- **Train databases** (7): `car_1, concert_singer, cre_Doc_Template_Mgt,
dog_kennels, employee_hire_evaluation, flight_2, student_assessment`
- **Eval databases** (4): `flight_2, pets_1, poker_player, world_1`
`flight_2` appears in both. The other eval DBs are schemas the model
never saw during training. `sql_env.training.data_loading.validate_no_data_leak`
asserts zero question-text overlap between the two files at load time.
## Question files
Both `questions_train.json` and `questions_eval.json` are lists of records
with this shape (actual sample from `car_1` train):
```json
{
"question_text": "How many cars have a larger accelerate than the car with the largest horsepower?",
"database_name": "car_1",
"gold_sql": "SELECT COUNT(*) FROM CARS_DATA WHERE Accelerate > (SELECT Accelerate FROM CARS_DATA ORDER BY Horsepower DESC LIMIT 1)",
"gold_answer": 39,
"answer_type": "integer",
"difficulty": "easy",
"tables_involved": ["CARS_DATA"],
"split": "train",
"question_id": "car_1_train_000"
}
```
### Counts and difficulty mix
| Split | Total | easy | medium | hard |
|---|---|---|---|---|
| train | 473 | 435 | 32 | 6 |
| eval | 203 | 185 | 18 | 0 |
The easy-heavy distribution is deliberate for the 0.6B capacity ceiling.
Extended GRPO training on harder questions produced identical accuracy,
which indicates the ceiling comes from pretraining knowledge rather than
training budget. Medium and hard questions stay in the mix for Phase 2
exposure but are not where this model size gains accuracy. See the
"Limitations at 0.6B Parameters" section of the
[blog post](https://hjerpe-sqlenv-blog.static.hf.space).
### Curation pipeline
`scripts/curate_questions.py` turns raw Spider records into the format
above. Per question, it:
1. Filters to databases in `db_list.json`.
2. Executes `gold_sql` against the real SQLite database to produce
`gold_answer` β€” this is what the environment grades agents against,
not a string match on the SQL.
3. Normalizes the answer into a typed shape (`integer`, `string`,
`list[...]`, `table`) via `answer_type`.
4. Parses `FROM` and `JOIN` tokens to fill `tables_involved` (used by SFT
generation to decide which tables to `describe()`).
5. Assigns the Spider-provided difficulty label.
6. Writes train and eval to separate files with a per-question
`question_id` derived from `{db_id}_{split}_{index}`.
Re-running the script is idempotent given the same Spider snapshot.
## SFT warmup trajectories (`data/sft/sft_trajectories.json`)
120 multi-turn trajectories used as a supervised warmup before GRPO.
Each record has `messages` (tool-calling chat format) and `tools` (the
four SQLEnv tool schemas).
### How they are generated
`scripts/generate_sft_data.py` walks the training questions and, for each
one, runs the real `SQLEnvironment` programmatically:
1. `describe(table)` for every table in `tables_involved`
2. `query(gold_sql)` β€” captures real tabular output from the environment
3. `answer(gold_answer)` β€” terminal step
The captured sequence becomes an assistant-labelled trajectory. This is
**not synthetic text**. The assistant turns wrap the actual environment
responses the model will see at training and inference time, so the
SFT-warmed reference policy already expects real env output when GRPO
takes over.
SFT uses 120 trajectories rather than one per training question. The
subset is chosen to cover each database and difficulty bucket. See
`scripts/generate_sft_data.py` for the selection logic.
Why multi-turn matters: an earlier per-turn SFT (347 single-turn
examples) taught the model to always call `describe`. Half those
examples were describe calls, so the model learned "when asked a
question, call describe." Under a KL penalty during GRPO, every rollout
stayed identical, the advantage between rollouts was zero, and no policy
gradient could form. Multi-turn SFT (120 full trajectories trained with
`assistant_only_loss`) instead teaches the full
`describe β†’ query β†’ answer` sequence as a coherent strategy, which GRPO
then refines into error recovery and answer formatting. See the
"Training" section of the
[blog post](https://hjerpe-sqlenv-blog.static.hf.space).
## How to regenerate from scratch
```bash
# 1. Databases (SQLite files from the Spider mirror)
uv run python scripts/download_spider_databases.py --db-id all
# 2. Raw Spider questions (via HF Datasets)
uv run python scripts/download_spider_questions.py --db-id all --split train
uv run python scripts/download_spider_questions.py --db-id all --split validation
# 3. Curate into questions_train.json / questions_eval.json
uv run python scripts/curate_questions.py
# 4. Regenerate SFT trajectories from gold SQL
uv run python scripts/generate_sft_data.py
```
You should not need to do this in normal operation β€” the curated files
are committed. Regenerate only when updating the database allowlist,
changing the answer-type taxonomy, or rerunning against a new Spider
snapshot.
## What we deliberately do not use
- **BIRD** (Li et al., 2023) β€” larger, harder text-to-SQL benchmark. Out
of scope for a 0.6B model. Revisit for a larger-model follow-up.
- **WikiSQL** β€” single-table only, doesn't exercise the multi-turn
exploration the environment is built for.
- **Synthetic LLM-generated questions** β€” we want Spider's human-written
prompts so eval results are comparable to published work.
- **Spider databases outside `db_list.json`** β€” kept out to keep the
repo small and training fast. Easy to widen by editing the list and
rerunning the regeneration pipeline.
## References
- [Spider dataset (Yale LILY)](https://yale-lily.github.io/spider)
- [taoyds/spider GitHub mirror](https://github.com/taoyds/spider)
- [xlangai/spider on HuggingFace](https://huggingface.co/datasets/xlangai/spider)
- Yu et al. (2018). *Spider: A Large-Scale Human-Labeled Dataset for
Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.* EMNLP.