# System Behavior: Dataset Curation > Living document. Updated by `/archive-spec` when features are completed. > Last archived: F004 on 2026-03-24 --- ## ADDED ### Curation script produces enriched question dataset Running `python scripts/curate_questions.py` produces two JSON files (`data/questions/questions_train.json` and `data/questions/questions_eval.json`) containing 100+ enriched questions across 10 Spider databases. Each question record includes `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved`, and `split` fields. ### Curation script downloads Spider SQLite databases on demand Running `python scripts/curate_questions.py` downloads Spider SQLite database files into `data/databases/{db_id}/{db_id}.sqlite` for each configured database. Existing files are skipped. ### Curation script accepts validate-only mode Running `python scripts/curate_questions.py --validate` validates the existing dataset files without downloading or re-generating. It checks field completeness, gold SQL execution, answer correctness, split integrity, and difficulty distribution. Returns exit code 0 if valid, 1 if invalid. ### Dataset provides train/eval split The dataset is split into `questions_train.json` (approximately 70%) and `questions_eval.json` (approximately 30%) with no overlapping question IDs between the two files. ### Dataset covers multiple domains and difficulty levels Questions span 10 Spider databases from diverse domains (education, entertainment, geography, automotive, HR, etc.) with difficulty distribution targeting approximately 40% easy, 40% medium, 20% hard based on the number of tables involved in each query.