System Behavior: Dataset Curation
Living document. Updated by
/archive-specwhen features are completed. Last archived: F004 on 2026-03-24
ADDED
Curation script produces enriched question dataset
Running python scripts/curate_questions.py produces two JSON files (data/questions/questions_train.json and data/questions/questions_eval.json) containing 100+ enriched questions across 10 Spider databases. Each question record includes question_id, question_text, database_name, gold_sql, gold_answer, answer_type, difficulty, tables_involved, and split fields.
Curation script downloads Spider SQLite databases on demand
Running python scripts/curate_questions.py downloads Spider SQLite database files into data/databases/{db_id}/{db_id}.sqlite for each configured database. Existing files are skipped.
Curation script accepts validate-only mode
Running python scripts/curate_questions.py --validate validates the existing dataset files without downloading or re-generating. It checks field completeness, gold SQL execution, answer correctness, split integrity, and difficulty distribution. Returns exit code 0 if valid, 1 if invalid.
Dataset provides train/eval split
The dataset is split into questions_train.json (approximately 70%) and questions_eval.json (approximately 30%) with no overlapping question IDs between the two files.
Dataset covers multiple domains and difficulty levels
Questions span 10 Spider databases from diverse domains (education, entertainment, geography, automotive, HR, etc.) with difficulty distribution targeting approximately 40% easy, 40% medium, 20% hard based on the number of tables involved in each query.