| # Data Sources |
|
|
| Reference for what data SQLEnv uses, where it comes from, and how to |
| regenerate it. All data lives under `data/` and is checked into the repo, |
| so a fresh clone works offline after `uv sync`. |
|
|
| ## Summary |
|
|
| | Artifact | Path | Origin | Count | |
| |---|---|---|---| |
| | SQLite databases | `data/databases/<db_id>/<db_id>.sqlite` | [Spider](https://yale-lily.github.io/spider) (taoyds/spider on GitHub) | 10 databases | |
| | Training questions | `data/questions/questions_train.json` | Spider train split (`xlangai/spider`) | 473 questions | |
| | Eval questions | `data/questions/questions_eval.json` | Spider validation split (`xlangai/spider`) | 203 questions | |
| | DB allowlist | `data/questions/db_list.json` | hand-curated subset | 10 db_ids | |
| | SFT trajectories | `data/sft/sft_trajectories.json` | generated from gold SQL | 120 trajectories | |
|
|
| Total: 676 questions across 10 Spider databases, plus 120 multi-turn SFT |
| warmup trajectories. |
|
|
| ## Upstream: Spider |
|
|
| [Spider](https://yale-lily.github.io/spider) (Yu et al., EMNLP 2018) is a |
| cross-domain text-to-SQL benchmark with ~200 databases and ~10k |
| question/gold-SQL pairs. Every question has a natural-language prompt, a |
| gold SQL query, and a target database. We use two mirrors: |
|
|
| 1. **Questions** via HuggingFace Datasets: [`xlangai/spider`](https://huggingface.co/datasets/xlangai/spider) |
| β loaded with `datasets.load_dataset("xlangai/spider", split=...)` in |
| `scripts/download_spider_questions.py`. |
| 2. **SQLite databases** via the Spider GitHub mirror: |
| - `https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite` |
| - Fallback: the official Google Drive archive |
| (`1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J`) |
| β fetched by `scripts/download_spider_databases.py`. |
| |
| Spider's license is CC BY-SA 4.0. |
|
|
| ## The 10-database subset (`db_list.json`) |
| |
| We do not ship all ~200 Spider databases. The allowlist in |
| `data/questions/db_list.json` pins 10: |
|
|
| ``` |
| student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation, |
| pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player |
| ``` |
|
|
| These were chosen for schema variety (single-table aggregates, 2-table |
| joins, 3-table FK chains, naming-convention quirks) while keeping the |
| repo small and the training loop fast. |
|
|
| **Train/eval split is by database**, not random sampling within a |
| database. This prevents train/eval leakage at the schema level: |
|
|
| - **Train databases** (7): `car_1, concert_singer, cre_Doc_Template_Mgt, |
| dog_kennels, employee_hire_evaluation, flight_2, student_assessment` |
| - **Eval databases** (4): `flight_2, pets_1, poker_player, world_1` |
|
|
| `flight_2` appears in both. The other eval DBs are schemas the model |
| never saw during training. `sql_env.training.data_loading.validate_no_data_leak` |
| asserts zero question-text overlap between the two files at load time. |
|
|
| ## Question files |
|
|
| Both `questions_train.json` and `questions_eval.json` are lists of records |
| with this shape (actual sample from `car_1` train): |
|
|
| ```json |
| { |
| "question_text": "How many cars have a larger accelerate than the car with the largest horsepower?", |
| "database_name": "car_1", |
| "gold_sql": "SELECT COUNT(*) FROM CARS_DATA WHERE Accelerate > (SELECT Accelerate FROM CARS_DATA ORDER BY Horsepower DESC LIMIT 1)", |
| "gold_answer": 39, |
| "answer_type": "integer", |
| "difficulty": "easy", |
| "tables_involved": ["CARS_DATA"], |
| "split": "train", |
| "question_id": "car_1_train_000" |
| } |
| ``` |
|
|
| ### Counts and difficulty mix |
|
|
| | Split | Total | easy | medium | hard | |
| |---|---|---|---|---| |
| | train | 473 | 435 | 32 | 6 | |
| | eval | 203 | 185 | 18 | 0 | |
|
|
| The easy-heavy distribution is deliberate for the 0.6B capacity ceiling. |
| Extended GRPO training on harder questions produced identical accuracy, |
| which indicates the ceiling comes from pretraining knowledge rather than |
| training budget. Medium and hard questions stay in the mix for Phase 2 |
| exposure but are not where this model size gains accuracy. See the |
| "Limitations at 0.6B Parameters" section of the |
| [blog post](https://hjerpe-sqlenv-blog.static.hf.space). |
|
|
| ### Curation pipeline |
|
|
| `scripts/curate_questions.py` turns raw Spider records into the format |
| above. Per question, it: |
|
|
| 1. Filters to databases in `db_list.json`. |
| 2. Executes `gold_sql` against the real SQLite database to produce |
| `gold_answer` β this is what the environment grades agents against, |
| not a string match on the SQL. |
| 3. Normalizes the answer into a typed shape (`integer`, `string`, |
| `list[...]`, `table`) via `answer_type`. |
| 4. Parses `FROM` and `JOIN` tokens to fill `tables_involved` (used by SFT |
| generation to decide which tables to `describe()`). |
| 5. Assigns the Spider-provided difficulty label. |
| 6. Writes train and eval to separate files with a per-question |
| `question_id` derived from `{db_id}_{split}_{index}`. |
|
|
| Re-running the script is idempotent given the same Spider snapshot. |
|
|
| ## SFT warmup trajectories (`data/sft/sft_trajectories.json`) |
| |
| 120 multi-turn trajectories used as a supervised warmup before GRPO. |
| Each record has `messages` (tool-calling chat format) and `tools` (the |
| four SQLEnv tool schemas). |
| |
| ### How they are generated |
| |
| `scripts/generate_sft_data.py` walks the training questions and, for each |
| one, runs the real `SQLEnvironment` programmatically: |
| |
| 1. `describe(table)` for every table in `tables_involved` |
| 2. `query(gold_sql)` β captures real tabular output from the environment |
| 3. `answer(gold_answer)` β terminal step |
|
|
| The captured sequence becomes an assistant-labelled trajectory. This is |
| **not synthetic text**. The assistant turns wrap the actual environment |
| responses the model will see at training and inference time, so the |
| SFT-warmed reference policy already expects real env output when GRPO |
| takes over. |
|
|
| SFT uses 120 trajectories rather than one per training question. The |
| subset is chosen to cover each database and difficulty bucket. See |
| `scripts/generate_sft_data.py` for the selection logic. |
|
|
| Why multi-turn matters: an earlier per-turn SFT (347 single-turn |
| examples) taught the model to always call `describe`. Half those |
| examples were describe calls, so the model learned "when asked a |
| question, call describe." Under a KL penalty during GRPO, every rollout |
| stayed identical, the advantage between rollouts was zero, and no policy |
| gradient could form. Multi-turn SFT (120 full trajectories trained with |
| `assistant_only_loss`) instead teaches the full |
| `describe β query β answer` sequence as a coherent strategy, which GRPO |
| then refines into error recovery and answer formatting. See the |
| "Training" section of the |
| [blog post](https://hjerpe-sqlenv-blog.static.hf.space). |
|
|
| ## How to regenerate from scratch |
|
|
| ```bash |
| # 1. Databases (SQLite files from the Spider mirror) |
| uv run python scripts/download_spider_databases.py --db-id all |
| |
| # 2. Raw Spider questions (via HF Datasets) |
| uv run python scripts/download_spider_questions.py --db-id all --split train |
| uv run python scripts/download_spider_questions.py --db-id all --split validation |
| |
| # 3. Curate into questions_train.json / questions_eval.json |
| uv run python scripts/curate_questions.py |
| |
| # 4. Regenerate SFT trajectories from gold SQL |
| uv run python scripts/generate_sft_data.py |
| ``` |
|
|
| You should not need to do this in normal operation β the curated files |
| are committed. Regenerate only when updating the database allowlist, |
| changing the answer-type taxonomy, or rerunning against a new Spider |
| snapshot. |
|
|
| ## What we deliberately do not use |
|
|
| - **BIRD** (Li et al., 2023) β larger, harder text-to-SQL benchmark. Out |
| of scope for a 0.6B model. Revisit for a larger-model follow-up. |
| - **WikiSQL** β single-table only, doesn't exercise the multi-turn |
| exploration the environment is built for. |
| - **Synthetic LLM-generated questions** β we want Spider's human-written |
| prompts so eval results are comparable to published work. |
| - **Spider databases outside `db_list.json`** β kept out to keep the |
| repo small and training fast. Easy to widen by editing the list and |
| rerunning the regeneration pipeline. |
| |
| ## References |
| |
| - [Spider dataset (Yale LILY)](https://yale-lily.github.io/spider) |
| - [taoyds/spider GitHub mirror](https://github.com/taoyds/spider) |
| - [xlangai/spider on HuggingFace](https://huggingface.co/datasets/xlangai/spider) |
| - Yu et al. (2018). *Spider: A Large-Scale Human-Labeled Dataset for |
| Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.* EMNLP. |
| |