Feature Demo: F004 — Question Dataset Expansion
Generated: 2026-03-24T21:07:31Z Context source: spec + discovery only (implementation not read) Feature entry: FEATURES.json #F004
What This Feature Does
Before this feature, training data came from a single database and could overfit to one schema. F004 expands that into a curated multi-database dataset so training and evaluation reflect more realistic SQL variety.
From a user perspective, this feels like a repeatable CLI workflow: generate enriched train/eval JSON once, then validate it quickly before downstream training. You get precomputed gold answers, answer types, difficulty labels, and deterministic splits.
What Is Already Proven
Verified in This Demo Run
- Ran full curation pipeline locally and observed generated outputs: 473 train + 203 eval (676 total).
- Ran
--validatemode locally and observed successful validation for all 676 records. - Verified split ratio and database coverage from generated artifacts (
train_ratio=0.6997,eval_ratio=0.3003,db_count=10). - Ran an invalid CLI input case (
--db-listmissing path) and captured the real failure output. - Ran repository smoke tests (
21 passed).
Previously Verified Evidence
specs/FEATURES.json(verification_evidencefor F004): verifier approved,uv run pytest tests/ -v, 21/21 passed at2026-03-24T21:04:54Z.specs/F004-IMPLEMENTATION_SPEC.md(Step 2.3): prior validation evidence recorded for 676 curated records and ~70/30 split.
What Still Needs User Verification
None for local CLI proof.
Optional product check: decide whether current MVP difficulty skew warnings are acceptable for your training goals.
Quickstart / Verification Steps
Run these commands to see the feature in action:
uv run python scripts/curate_questions.py
uv run python scripts/curate_questions.py --validate
Requires local Python/uv environment and access to existing project data directories.
Live Local Proof
Generate the Curated Train/Eval Datasets
This runs the user-facing curation pipeline end-to-end.
uv run python scripts/curate_questions.py
WARNING: Difficulty distribution off target: easy=91.72% (target 40%)
WARNING: Difficulty distribution off target: medium=7.40% (target 40%)
WARNING: Difficulty distribution off target: hard=0.89% (target 20%)
Prepared 10 databases in data/databases
Loaded 676 Spider questions
Curated 676 questions (skipped 0)
Validation passed
Wrote 473 train records to data/questions/questions_train.json
Wrote 203 eval records to data/questions/questions_eval.json
Notice the pipeline completes successfully and writes both split files.
Validate Existing Curated Outputs
This is the fast re-check path users can run before training.
uv run python scripts/curate_questions.py --validate
WARNING: Difficulty distribution off target: easy=91.72% (target 40%)
WARNING: Difficulty distribution off target: medium=7.40% (target 40%)
WARNING: Difficulty distribution off target: hard=0.89% (target 20%)
Validation passed for 676 curated records
Notice validation passes while surfacing non-blocking MVP warnings.
Existing Evidence
- F004
verification_evidenceinspecs/FEATURES.json: 21/21 smoke tests passed, verifier statusapproved. specs/F004-IMPLEMENTATION_SPEC.mdStep 2.3: prior recorded split metrics (473/203) and validation pass.
Manual Verification Checklist
- Run full curation command and confirm both JSON files are written.
- Run
--validateand confirm exit succeeds withValidation passedmessage. - Confirm split counts are close to 70/30.
- Confirm warnings (if any) match your accepted MVP quality bar.
Edge Cases Exercised
Boundary Check: Split Ratio and DB Coverage
uv run python -c "import json; from pathlib import Path; train=json.loads(Path('data/questions/questions_train.json').read_text()); eval_=json.loads(Path('data/questions/questions_eval.json').read_text()); total=len(train)+len(eval_); dbs=sorted({q['database_name'] for q in train+eval_}); print(f'train={len(train)} eval={len(eval_)} total={total} train_ratio={len(train)/total:.4f} eval_ratio={len(eval_)/total:.4f} db_count={len(dbs)}')"
train=473 eval=203 total=676 train_ratio=0.6997 eval_ratio=0.3003 db_count=10
This confirms the split target and multi-database coverage from actual artifacts.
Error Case: Missing --db-list Path
uv run python scripts/curate_questions.py --db-list data/questions/does_not_exist.json
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'data/questions/does_not_exist.json'
This shows current behavior for invalid input path (real failure output captured).
Test Evidence (Optional)
Supplementary proof that the repository remains healthy.
| Test Suite | Tests | Status |
|---|---|---|
uv run pytest tests/ -v |
21 | All passed |
Representative command run:
uv run pytest tests/ -v
Result summary: ============================== 21 passed in 8.48s ==============================
Feature Links
- Implementation spec:
specs/F004-IMPLEMENTATION_SPEC.md - Verification spec:
specs/F004-VERIFICATION_SPEC.md
Demo generated by feature-demo agent. Re-run with /feature-demo F004 to refresh.