sql_env / specs /F004-DEMO.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Feature Demo: F004 — Question Dataset Expansion

Generated: 2026-03-24T21:07:31Z Context source: spec + discovery only (implementation not read) Feature entry: FEATURES.json #F004


What This Feature Does

Before this feature, training data came from a single database and could overfit to one schema. F004 expands that into a curated multi-database dataset so training and evaluation reflect more realistic SQL variety.

From a user perspective, this feels like a repeatable CLI workflow: generate enriched train/eval JSON once, then validate it quickly before downstream training. You get precomputed gold answers, answer types, difficulty labels, and deterministic splits.


What Is Already Proven

Verified in This Demo Run

  • Ran full curation pipeline locally and observed generated outputs: 473 train + 203 eval (676 total).
  • Ran --validate mode locally and observed successful validation for all 676 records.
  • Verified split ratio and database coverage from generated artifacts (train_ratio=0.6997, eval_ratio=0.3003, db_count=10).
  • Ran an invalid CLI input case (--db-list missing path) and captured the real failure output.
  • Ran repository smoke tests (21 passed).

Previously Verified Evidence

  • specs/FEATURES.json (verification_evidence for F004): verifier approved, uv run pytest tests/ -v, 21/21 passed at 2026-03-24T21:04:54Z.
  • specs/F004-IMPLEMENTATION_SPEC.md (Step 2.3): prior validation evidence recorded for 676 curated records and ~70/30 split.

What Still Needs User Verification

None for local CLI proof.
Optional product check: decide whether current MVP difficulty skew warnings are acceptable for your training goals.


Quickstart / Verification Steps

Run these commands to see the feature in action:

uv run python scripts/curate_questions.py
uv run python scripts/curate_questions.py --validate

Requires local Python/uv environment and access to existing project data directories.


Live Local Proof

Generate the Curated Train/Eval Datasets

This runs the user-facing curation pipeline end-to-end.

uv run python scripts/curate_questions.py
WARNING: Difficulty distribution off target: easy=91.72% (target 40%)
WARNING: Difficulty distribution off target: medium=7.40% (target 40%)
WARNING: Difficulty distribution off target: hard=0.89% (target 20%)
Prepared 10 databases in data/databases
Loaded 676 Spider questions
Curated 676 questions (skipped 0)
Validation passed
Wrote 473 train records to data/questions/questions_train.json
Wrote 203 eval records to data/questions/questions_eval.json

Notice the pipeline completes successfully and writes both split files.

Validate Existing Curated Outputs

This is the fast re-check path users can run before training.

uv run python scripts/curate_questions.py --validate
WARNING: Difficulty distribution off target: easy=91.72% (target 40%)
WARNING: Difficulty distribution off target: medium=7.40% (target 40%)
WARNING: Difficulty distribution off target: hard=0.89% (target 20%)
Validation passed for 676 curated records

Notice validation passes while surfacing non-blocking MVP warnings.


Existing Evidence

  • F004 verification_evidence in specs/FEATURES.json: 21/21 smoke tests passed, verifier status approved.
  • specs/F004-IMPLEMENTATION_SPEC.md Step 2.3: prior recorded split metrics (473/203) and validation pass.

Manual Verification Checklist

  1. Run full curation command and confirm both JSON files are written.
  2. Run --validate and confirm exit succeeds with Validation passed message.
  3. Confirm split counts are close to 70/30.
  4. Confirm warnings (if any) match your accepted MVP quality bar.

Edge Cases Exercised

Boundary Check: Split Ratio and DB Coverage

uv run python -c "import json; from pathlib import Path; train=json.loads(Path('data/questions/questions_train.json').read_text()); eval_=json.loads(Path('data/questions/questions_eval.json').read_text()); total=len(train)+len(eval_); dbs=sorted({q['database_name'] for q in train+eval_}); print(f'train={len(train)} eval={len(eval_)} total={total} train_ratio={len(train)/total:.4f} eval_ratio={len(eval_)/total:.4f} db_count={len(dbs)}')"
train=473 eval=203 total=676 train_ratio=0.6997 eval_ratio=0.3003 db_count=10

This confirms the split target and multi-database coverage from actual artifacts.

Error Case: Missing --db-list Path

uv run python scripts/curate_questions.py --db-list data/questions/does_not_exist.json
Traceback (most recent call last):
  ...
FileNotFoundError: [Errno 2] No such file or directory: 'data/questions/does_not_exist.json'

This shows current behavior for invalid input path (real failure output captured).


Test Evidence (Optional)

Supplementary proof that the repository remains healthy.

Test Suite Tests Status
uv run pytest tests/ -v 21 All passed

Representative command run:

uv run pytest tests/ -v

Result summary: ============================== 21 passed in 8.48s ==============================


Feature Links

  • Implementation spec: specs/F004-IMPLEMENTATION_SPEC.md
  • Verification spec: specs/F004-VERIFICATION_SPEC.md

Demo generated by feature-demo agent. Re-run with /feature-demo F004 to refresh.