Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs /data-sources.md

hjerpe

Upload folder using huggingface_hub

a001a97 verified 4 days ago

preview code

raw

history blame contribute delete

8.37 kB

	# Data Sources

	Reference for what data SQLEnv uses, where it comes from, and how to
	regenerate it. All data lives under `data/` and is checked into the repo,
	so a fresh clone works offline after `uv sync`.

	## Summary

	\| Artifact \| Path \| Origin \| Count \|
	\|---\|---\|---\|---\|
	\| SQLite databases \| `data/databases/<db_id>/<db_id>.sqlite` \| [Spider](https://yale-lily.github.io/spider) (taoyds/spider on GitHub) \| 10 databases \|
	\| Training questions \| `data/questions/questions_train.json` \| Spider train split (`xlangai/spider`) \| 473 questions \|
	\| Eval questions \| `data/questions/questions_eval.json` \| Spider validation split (`xlangai/spider`) \| 203 questions \|
	\| DB allowlist \| `data/questions/db_list.json` \| hand-curated subset \| 10 db_ids \|
	\| SFT trajectories \| `data/sft/sft_trajectories.json` \| generated from gold SQL \| 120 trajectories \|

	Total: 676 questions across 10 Spider databases, plus 120 multi-turn SFT
	warmup trajectories.

	## Upstream: Spider

	[Spider](https://yale-lily.github.io/spider) (Yu et al., EMNLP 2018) is a
	cross-domain text-to-SQL benchmark with ~200 databases and ~10k
	question/gold-SQL pairs. Every question has a natural-language prompt, a
	gold SQL query, and a target database. We use two mirrors:

	1. Questions via HuggingFace Datasets: [`xlangai/spider`](https://huggingface.co/datasets/xlangai/spider)
	— loaded with `datasets.load_dataset("xlangai/spider", split=...)` in
	`scripts/download_spider_questions.py`.
	2. SQLite databases via the Spider GitHub mirror:
	- `https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite`
	- Fallback: the official Google Drive archive
	(`1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J`)
	— fetched by `scripts/download_spider_databases.py`.

	Spider's license is CC BY-SA 4.0.

	## The 10-database subset (`db_list.json`)

	We do not ship all ~200 Spider databases. The allowlist in
	`data/questions/db_list.json` pins 10:

	```
	student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation,
	pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player
	```

	These were chosen for schema variety (single-table aggregates, 2-table
	joins, 3-table FK chains, naming-convention quirks) while keeping the
	repo small and the training loop fast.

	Train/eval split is by database, not random sampling within a
	database. This prevents train/eval leakage at the schema level:

	- Train databases (7): `car_1, concert_singer, cre_Doc_Template_Mgt,
	dog_kennels, employee_hire_evaluation, flight_2, student_assessment`
	- Eval databases (4): `flight_2, pets_1, poker_player, world_1`

	`flight_2` appears in both. The other eval DBs are schemas the model
	never saw during training. `sql_env.training.data_loading.validate_no_data_leak`
	asserts zero question-text overlap between the two files at load time.

	## Question files

	Both `questions_train.json` and `questions_eval.json` are lists of records
	with this shape (actual sample from `car_1` train):

	```json
	{
	"question_text": "How many cars have a larger accelerate than the car with the largest horsepower?",
	"database_name": "car_1",
	"gold_sql": "SELECT COUNT(*) FROM CARS_DATA WHERE Accelerate > (SELECT Accelerate FROM CARS_DATA ORDER BY Horsepower DESC LIMIT 1)",
	"gold_answer": 39,
	"answer_type": "integer",
	"difficulty": "easy",
	"tables_involved": ["CARS_DATA"],
	"split": "train",
	"question_id": "car_1_train_000"
	}
	```

	### Counts and difficulty mix

	\| Split \| Total \| easy \| medium \| hard \|
	\|---\|---\|---\|---\|---\|
	\| train \| 473 \| 435 \| 32 \| 6 \|
	\| eval \| 203 \| 185 \| 18 \| 0 \|

	The easy-heavy distribution is deliberate for the 0.6B capacity ceiling.
	Extended GRPO training on harder questions produced identical accuracy,
	which indicates the ceiling comes from pretraining knowledge rather than
	training budget. Medium and hard questions stay in the mix for Phase 2
	exposure but are not where this model size gains accuracy. See the
	"Limitations at 0.6B Parameters" section of the
	[blog post](https://hjerpe-sqlenv-blog.static.hf.space).

	### Curation pipeline

	`scripts/curate_questions.py` turns raw Spider records into the format
	above. Per question, it:

	1. Filters to databases in `db_list.json`.
	2. Executes `gold_sql` against the real SQLite database to produce
	`gold_answer` — this is what the environment grades agents against,
	not a string match on the SQL.
	3. Normalizes the answer into a typed shape (`integer`, `string`,
	`list[...]`, `table`) via `answer_type`.
	4. Parses `FROM` and `JOIN` tokens to fill `tables_involved` (used by SFT
	generation to decide which tables to `describe()`).
	5. Assigns the Spider-provided difficulty label.
	6. Writes train and eval to separate files with a per-question
	`question_id` derived from `{db_id}_{split}_{index}`.

	Re-running the script is idempotent given the same Spider snapshot.

	## SFT warmup trajectories (`data/sft/sft_trajectories.json`)

	120 multi-turn trajectories used as a supervised warmup before GRPO.
	Each record has `messages` (tool-calling chat format) and `tools` (the
	four SQLEnv tool schemas).

	### How they are generated

	`scripts/generate_sft_data.py` walks the training questions and, for each
	one, runs the real `SQLEnvironment` programmatically:

	1. `describe(table)` for every table in `tables_involved`
	2. `query(gold_sql)` — captures real tabular output from the environment
	3. `answer(gold_answer)` — terminal step

	The captured sequence becomes an assistant-labelled trajectory. This is
	not synthetic text. The assistant turns wrap the actual environment
	responses the model will see at training and inference time, so the
	SFT-warmed reference policy already expects real env output when GRPO
	takes over.

	SFT uses 120 trajectories rather than one per training question. The
	subset is chosen to cover each database and difficulty bucket. See
	`scripts/generate_sft_data.py` for the selection logic.

	Why multi-turn matters: an earlier per-turn SFT (347 single-turn
	examples) taught the model to always call `describe`. Half those
	examples were describe calls, so the model learned "when asked a
	question, call describe." Under a KL penalty during GRPO, every rollout
	stayed identical, the advantage between rollouts was zero, and no policy
	gradient could form. Multi-turn SFT (120 full trajectories trained with
	`assistant_only_loss`) instead teaches the full
	`describe → query → answer` sequence as a coherent strategy, which GRPO
	then refines into error recovery and answer formatting. See the
	"Training" section of the
	[blog post](https://hjerpe-sqlenv-blog.static.hf.space).

	## How to regenerate from scratch

	```bash
	# 1. Databases (SQLite files from the Spider mirror)
	uv run python scripts/download_spider_databases.py --db-id all

	# 2. Raw Spider questions (via HF Datasets)
	uv run python scripts/download_spider_questions.py --db-id all --split train
	uv run python scripts/download_spider_questions.py --db-id all --split validation

	# 3. Curate into questions_train.json / questions_eval.json
	uv run python scripts/curate_questions.py

	# 4. Regenerate SFT trajectories from gold SQL
	uv run python scripts/generate_sft_data.py
	```

	You should not need to do this in normal operation — the curated files
	are committed. Regenerate only when updating the database allowlist,
	changing the answer-type taxonomy, or rerunning against a new Spider
	snapshot.

	## What we deliberately do not use

	- BIRD (Li et al., 2023) — larger, harder text-to-SQL benchmark. Out
	of scope for a 0.6B model. Revisit for a larger-model follow-up.
	- WikiSQL — single-table only, doesn't exercise the multi-turn
	exploration the environment is built for.
	- Synthetic LLM-generated questions — we want Spider's human-written
	prompts so eval results are comparable to published work.
	- Spider databases outside `db_list.json` — kept out to keep the
	repo small and training fast. Easy to widen by editing the list and
	rerunning the regeneration pipeline.

	## References

	- [Spider dataset (Yale LILY)](https://yale-lily.github.io/spider)
	- [taoyds/spider GitHub mirror](https://github.com/taoyds/spider)
	- [xlangai/spider on HuggingFace](https://huggingface.co/datasets/xlangai/spider)
	- Yu et al. (2018). *Spider: A Large-Scale Human-Labeled Dataset for
	Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.* EMNLP.