# Feature Demo: F004 — Question Dataset Expansion > **Generated:** 2026-03-24T21:07:31Z > **Context source:** spec + discovery only (implementation not read) > **Feature entry:** [FEATURES.json #F004](./FEATURES.json) --- ## What This Feature Does Before this feature, training data came from a single database and could overfit to one schema. F004 expands that into a curated multi-database dataset so training and evaluation reflect more realistic SQL variety. From a user perspective, this feels like a repeatable CLI workflow: generate enriched train/eval JSON once, then validate it quickly before downstream training. You get precomputed gold answers, answer types, difficulty labels, and deterministic splits. --- ## What Is Already Proven ### Verified in This Demo Run - Ran full curation pipeline locally and observed generated outputs: 473 train + 203 eval (676 total). - Ran `--validate` mode locally and observed successful validation for all 676 records. - Verified split ratio and database coverage from generated artifacts (`train_ratio=0.6997`, `eval_ratio=0.3003`, `db_count=10`). - Ran an invalid CLI input case (`--db-list` missing path) and captured the real failure output. - Ran repository smoke tests (`21 passed`). ### Previously Verified Evidence - `specs/FEATURES.json` (`verification_evidence` for F004): verifier approved, `uv run pytest tests/ -v`, 21/21 passed at `2026-03-24T21:04:54Z`. - `specs/F004-IMPLEMENTATION_SPEC.md` (Step 2.3): prior validation evidence recorded for 676 curated records and ~70/30 split. --- ## What Still Needs User Verification None for local CLI proof. Optional product check: decide whether current MVP difficulty skew warnings are acceptable for your training goals. --- ## Quickstart / Verification Steps > Run these commands to see the feature in action: ```bash uv run python scripts/curate_questions.py uv run python scripts/curate_questions.py --validate ``` Requires local Python/uv environment and access to existing project data directories. --- ## Live Local Proof ### Generate the Curated Train/Eval Datasets This runs the user-facing curation pipeline end-to-end. ```bash uv run python scripts/curate_questions.py ``` ``` WARNING: Difficulty distribution off target: easy=91.72% (target 40%) WARNING: Difficulty distribution off target: medium=7.40% (target 40%) WARNING: Difficulty distribution off target: hard=0.89% (target 20%) Prepared 10 databases in data/databases Loaded 676 Spider questions Curated 676 questions (skipped 0) Validation passed Wrote 473 train records to data/questions/questions_train.json Wrote 203 eval records to data/questions/questions_eval.json ``` Notice the pipeline completes successfully and writes both split files. ### Validate Existing Curated Outputs This is the fast re-check path users can run before training. ```bash uv run python scripts/curate_questions.py --validate ``` ``` WARNING: Difficulty distribution off target: easy=91.72% (target 40%) WARNING: Difficulty distribution off target: medium=7.40% (target 40%) WARNING: Difficulty distribution off target: hard=0.89% (target 20%) Validation passed for 676 curated records ``` Notice validation passes while surfacing non-blocking MVP warnings. --- ## Existing Evidence - F004 `verification_evidence` in `specs/FEATURES.json`: 21/21 smoke tests passed, verifier status `approved`. - `specs/F004-IMPLEMENTATION_SPEC.md` Step 2.3: prior recorded split metrics (`473/203`) and validation pass. --- ## Manual Verification Checklist 1. Run full curation command and confirm both JSON files are written. 2. Run `--validate` and confirm exit succeeds with `Validation passed` message. 3. Confirm split counts are close to 70/30. 4. Confirm warnings (if any) match your accepted MVP quality bar. --- ## Edge Cases Exercised ### Boundary Check: Split Ratio and DB Coverage ```bash uv run python -c "import json; from pathlib import Path; train=json.loads(Path('data/questions/questions_train.json').read_text()); eval_=json.loads(Path('data/questions/questions_eval.json').read_text()); total=len(train)+len(eval_); dbs=sorted({q['database_name'] for q in train+eval_}); print(f'train={len(train)} eval={len(eval_)} total={total} train_ratio={len(train)/total:.4f} eval_ratio={len(eval_)/total:.4f} db_count={len(dbs)}')" ``` ``` train=473 eval=203 total=676 train_ratio=0.6997 eval_ratio=0.3003 db_count=10 ``` This confirms the split target and multi-database coverage from actual artifacts. ### Error Case: Missing `--db-list` Path ```bash uv run python scripts/curate_questions.py --db-list data/questions/does_not_exist.json ``` ``` Traceback (most recent call last): ... FileNotFoundError: [Errno 2] No such file or directory: 'data/questions/does_not_exist.json' ``` This shows current behavior for invalid input path (real failure output captured). --- ## Test Evidence (Optional) > Supplementary proof that the repository remains healthy. | Test Suite | Tests | Status | |---|---|---| | `uv run pytest tests/ -v` | 21 | All passed | Representative command run: ```bash uv run pytest tests/ -v ``` Result summary: `============================== 21 passed in 8.48s ==============================` --- ## Feature Links - Implementation spec: `specs/F004-IMPLEMENTATION_SPEC.md` - Verification spec: `specs/F004-VERIFICATION_SPEC.md` --- *Demo generated by `feature-demo` agent. Re-run with `/feature-demo F004` to refresh.*