sql_env / specs /behavior /dataset-curation.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

System Behavior: Dataset Curation

Living document. Updated by /archive-spec when features are completed. Last archived: F004 on 2026-03-24


ADDED

Curation script produces enriched question dataset

Running python scripts/curate_questions.py produces two JSON files (data/questions/questions_train.json and data/questions/questions_eval.json) containing 100+ enriched questions across 10 Spider databases. Each question record includes question_id, question_text, database_name, gold_sql, gold_answer, answer_type, difficulty, tables_involved, and split fields.

Curation script downloads Spider SQLite databases on demand

Running python scripts/curate_questions.py downloads Spider SQLite database files into data/databases/{db_id}/{db_id}.sqlite for each configured database. Existing files are skipped.

Curation script accepts validate-only mode

Running python scripts/curate_questions.py --validate validates the existing dataset files without downloading or re-generating. It checks field completeness, gold SQL execution, answer correctness, split integrity, and difficulty distribution. Returns exit code 0 if valid, 1 if invalid.

Dataset provides train/eval split

The dataset is split into questions_train.json (approximately 70%) and questions_eval.json (approximately 30%) with no overlapping question IDs between the two files.

Dataset covers multiple domains and difficulty levels

Questions span 10 Spider databases from diverse domains (education, entertainment, geography, automotive, HR, etc.) with difficulty distribution targeting approximately 40% easy, 40% medium, 20% hard based on the number of tables involved in each query.