# Data Sources Reference for what data SQLEnv uses, where it comes from, and how to regenerate it. All data lives under `data/` and is checked into the repo, so a fresh clone works offline after `uv sync`. ## Summary | Artifact | Path | Origin | Count | |---|---|---|---| | SQLite databases | `data/databases//.sqlite` | [Spider](https://yale-lily.github.io/spider) (taoyds/spider on GitHub) | 10 databases | | Training questions | `data/questions/questions_train.json` | Spider train split (`xlangai/spider`) | 473 questions | | Eval questions | `data/questions/questions_eval.json` | Spider validation split (`xlangai/spider`) | 203 questions | | DB allowlist | `data/questions/db_list.json` | hand-curated subset | 10 db_ids | | SFT trajectories | `data/sft/sft_trajectories.json` | generated from gold SQL | 120 trajectories | Total: 676 questions across 10 Spider databases, plus 120 multi-turn SFT warmup trajectories. ## Upstream: Spider [Spider](https://yale-lily.github.io/spider) (Yu et al., EMNLP 2018) is a cross-domain text-to-SQL benchmark with ~200 databases and ~10k question/gold-SQL pairs. Every question has a natural-language prompt, a gold SQL query, and a target database. We use two mirrors: 1. **Questions** via HuggingFace Datasets: [`xlangai/spider`](https://huggingface.co/datasets/xlangai/spider) — loaded with `datasets.load_dataset("xlangai/spider", split=...)` in `scripts/download_spider_questions.py`. 2. **SQLite databases** via the Spider GitHub mirror: - `https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite` - Fallback: the official Google Drive archive (`1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J`) — fetched by `scripts/download_spider_databases.py`. Spider's license is CC BY-SA 4.0. ## The 10-database subset (`db_list.json`) We do not ship all ~200 Spider databases. The allowlist in `data/questions/db_list.json` pins 10: ``` student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation, pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player ``` These were chosen for schema variety (single-table aggregates, 2-table joins, 3-table FK chains, naming-convention quirks) while keeping the repo small and the training loop fast. **Train/eval split is by database**, not random sampling within a database. This prevents train/eval leakage at the schema level: - **Train databases** (7): `car_1, concert_singer, cre_Doc_Template_Mgt, dog_kennels, employee_hire_evaluation, flight_2, student_assessment` - **Eval databases** (4): `flight_2, pets_1, poker_player, world_1` `flight_2` appears in both. The other eval DBs are schemas the model never saw during training. `sql_env.training.data_loading.validate_no_data_leak` asserts zero question-text overlap between the two files at load time. ## Question files Both `questions_train.json` and `questions_eval.json` are lists of records with this shape (actual sample from `car_1` train): ```json { "question_text": "How many cars have a larger accelerate than the car with the largest horsepower?", "database_name": "car_1", "gold_sql": "SELECT COUNT(*) FROM CARS_DATA WHERE Accelerate > (SELECT Accelerate FROM CARS_DATA ORDER BY Horsepower DESC LIMIT 1)", "gold_answer": 39, "answer_type": "integer", "difficulty": "easy", "tables_involved": ["CARS_DATA"], "split": "train", "question_id": "car_1_train_000" } ``` ### Counts and difficulty mix | Split | Total | easy | medium | hard | |---|---|---|---|---| | train | 473 | 435 | 32 | 6 | | eval | 203 | 185 | 18 | 0 | The easy-heavy distribution is deliberate for the 0.6B capacity ceiling. Extended GRPO training on harder questions produced identical accuracy, which indicates the ceiling comes from pretraining knowledge rather than training budget. Medium and hard questions stay in the mix for Phase 2 exposure but are not where this model size gains accuracy. See the "Limitations at 0.6B Parameters" section of the [blog post](https://hjerpe-sqlenv-blog.static.hf.space). ### Curation pipeline `scripts/curate_questions.py` turns raw Spider records into the format above. Per question, it: 1. Filters to databases in `db_list.json`. 2. Executes `gold_sql` against the real SQLite database to produce `gold_answer` — this is what the environment grades agents against, not a string match on the SQL. 3. Normalizes the answer into a typed shape (`integer`, `string`, `list[...]`, `table`) via `answer_type`. 4. Parses `FROM` and `JOIN` tokens to fill `tables_involved` (used by SFT generation to decide which tables to `describe()`). 5. Assigns the Spider-provided difficulty label. 6. Writes train and eval to separate files with a per-question `question_id` derived from `{db_id}_{split}_{index}`. Re-running the script is idempotent given the same Spider snapshot. ## SFT warmup trajectories (`data/sft/sft_trajectories.json`) 120 multi-turn trajectories used as a supervised warmup before GRPO. Each record has `messages` (tool-calling chat format) and `tools` (the four SQLEnv tool schemas). ### How they are generated `scripts/generate_sft_data.py` walks the training questions and, for each one, runs the real `SQLEnvironment` programmatically: 1. `describe(table)` for every table in `tables_involved` 2. `query(gold_sql)` — captures real tabular output from the environment 3. `answer(gold_answer)` — terminal step The captured sequence becomes an assistant-labelled trajectory. This is **not synthetic text**. The assistant turns wrap the actual environment responses the model will see at training and inference time, so the SFT-warmed reference policy already expects real env output when GRPO takes over. SFT uses 120 trajectories rather than one per training question. The subset is chosen to cover each database and difficulty bucket. See `scripts/generate_sft_data.py` for the selection logic. Why multi-turn matters: an earlier per-turn SFT (347 single-turn examples) taught the model to always call `describe`. Half those examples were describe calls, so the model learned "when asked a question, call describe." Under a KL penalty during GRPO, every rollout stayed identical, the advantage between rollouts was zero, and no policy gradient could form. Multi-turn SFT (120 full trajectories trained with `assistant_only_loss`) instead teaches the full `describe → query → answer` sequence as a coherent strategy, which GRPO then refines into error recovery and answer formatting. See the "Training" section of the [blog post](https://hjerpe-sqlenv-blog.static.hf.space). ## How to regenerate from scratch ```bash # 1. Databases (SQLite files from the Spider mirror) uv run python scripts/download_spider_databases.py --db-id all # 2. Raw Spider questions (via HF Datasets) uv run python scripts/download_spider_questions.py --db-id all --split train uv run python scripts/download_spider_questions.py --db-id all --split validation # 3. Curate into questions_train.json / questions_eval.json uv run python scripts/curate_questions.py # 4. Regenerate SFT trajectories from gold SQL uv run python scripts/generate_sft_data.py ``` You should not need to do this in normal operation — the curated files are committed. Regenerate only when updating the database allowlist, changing the answer-type taxonomy, or rerunning against a new Spider snapshot. ## What we deliberately do not use - **BIRD** (Li et al., 2023) — larger, harder text-to-SQL benchmark. Out of scope for a 0.6B model. Revisit for a larger-model follow-up. - **WikiSQL** — single-table only, doesn't exercise the multi-turn exploration the environment is built for. - **Synthetic LLM-generated questions** — we want Spider's human-written prompts so eval results are comparable to published work. - **Spider databases outside `db_list.json`** — kept out to keep the repo small and training fast. Easy to widen by editing the list and rerunning the regeneration pipeline. ## References - [Spider dataset (Yale LILY)](https://yale-lily.github.io/spider) - [taoyds/spider GitHub mirror](https://github.com/taoyds/spider) - [xlangai/spider on HuggingFace](https://huggingface.co/datasets/xlangai/spider) - Yu et al. (2018). *Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.* EMNLP.