Verification Specification
Feature: F004 Generated from: specs/F004-VERIFICATION_INPUT.json Generated: 2026-03-24
1. Unit Tests
EnrichedQuestionRecord (Type)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_record_all_fields_present | All 9 required fields populated | Full valid record dict | All fields accessible, no missing keys | happy |
| test_record_question_id_format | question_id matches {db_id}_{split}_{index:03d} |
"concert_singer_train_007" |
Passes regex `^[a-z_]+_(train | eval)_\d{3}$` |
| test_record_question_id_invalid | Rejects malformed question_id | "bad-id" |
Validation error or detectable as invalid | error |
| test_record_answer_type_enum | answer_type is one of allowed values | "integer", "float", "string", "list", "table" |
Each accepted | happy |
| test_record_answer_type_invalid | Rejects unknown answer_type | "boolean" |
Rejected or flagged | error |
| test_record_difficulty_enum | difficulty is one of allowed values | "easy", "medium", "hard" |
Each accepted | happy |
| test_record_difficulty_invalid | Rejects unknown difficulty | "extreme" |
Rejected or flagged | error |
| test_record_split_enum | split is one of allowed values | "train", "eval" |
Each accepted | happy |
| test_record_split_invalid | Rejects unknown split | "test" |
Rejected or flagged | error |
| test_record_tables_involved_nonempty | tables_involved has at least one entry | ["students"] |
Accepted | happy |
| test_record_tables_involved_empty | Empty tables_involved is rejected | [] |
Rejected or flagged by validation | edge |
| test_record_gold_sql_nonempty | gold_sql is a non-empty string | "SELECT COUNT(*) FROM students" |
Accepted | happy |
| test_record_gold_sql_empty | Empty gold_sql is rejected | "" |
Rejected or flagged | edge |
| test_record_gold_answer_types | gold_answer can hold int, float, str, list, list-of-lists | 42, 3.14, "Alice", [1,2], [[1,"a"]] |
Each stored and retrievable | happy |
Run: pytest tests/test_f004_dataset.py::TestEnrichedQuestionRecord -v
classify_answer_type
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_classify_integer | Single integer answer | 42 |
"integer" |
happy |
| test_classify_float | Single float answer | 3.14 |
"float" |
happy |
| test_classify_string | Single string answer | "Alice" |
"string" |
happy |
| test_classify_list | Flat list (single column, multiple rows) | [1, 2, 3] |
"list" |
happy |
| test_classify_table | List of tuples/lists (multi-column result) | [(1, "a"), (2, "b")] |
"table" |
happy |
| test_classify_none | None/null answer | None |
Defined behavior (error or specific type) | edge |
| test_classify_empty_list | Empty list | [] |
"list" or defined behavior |
edge |
| test_classify_single_row_tuple | Single-element tuple | (42,) |
"integer" (unwrapped) or "list" |
edge |
| test_classify_nested_single | Single-row multi-column | [(1, "a")] |
"table" |
edge |
| test_classify_boolean | Boolean answer | True |
Defined fallback behavior | edge |
Run: pytest tests/test_f004_dataset.py::TestClassifyAnswerType -v
extract_tables_involved
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_extract_single_table | Simple SELECT from one table | "SELECT * FROM students" |
["students"] |
happy |
| test_extract_join | JOIN with two tables | "SELECT s.name FROM students s JOIN courses c ON s.id = c.student_id" |
["courses", "students"] (sorted) |
happy |
| test_extract_subquery | Subquery referencing different table | "SELECT * FROM students WHERE id IN (SELECT student_id FROM enrollments)" |
["enrollments", "students"] (sorted) |
happy |
| test_extract_deduplication | Same table referenced multiple times | "SELECT a.x, b.y FROM t1 a JOIN t1 b ON a.id = b.id" |
["t1"] (deduplicated) |
happy |
| test_extract_case_insensitive | Mixed case SQL keywords | "select * FROM Students" |
["Students"] or ["students"] (consistent) |
edge |
| test_extract_with_alias | Table alias should resolve to table name | "SELECT s.name FROM students AS s" |
["students"] |
edge |
| test_extract_multiple_joins | Three or more tables joined | "SELECT * FROM a JOIN b ON a.id=b.id JOIN c ON b.id=c.id" |
["a", "b", "c"] (sorted) |
happy |
| test_extract_empty_sql | Empty SQL string | "" |
[] or error |
edge |
| test_extract_no_from | SQL without FROM clause | "SELECT 1+1" |
[] |
edge |
Run: pytest tests/test_f004_dataset.py::TestExtractTablesInvolved -v
classify_difficulty
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_difficulty_easy_1_table | 1 table involved | ["students"] |
"easy" |
happy |
| test_difficulty_easy_2_tables | 2 tables involved | ["students", "courses"] |
"easy" |
happy |
| test_difficulty_medium_3_tables | 3 tables involved | ["a", "b", "c"] |
"medium" |
happy |
| test_difficulty_hard_4_tables | 4 tables involved | ["a", "b", "c", "d"] |
"hard" |
happy |
| test_difficulty_hard_many_tables | 6+ tables involved | ["a", "b", "c", "d", "e", "f"] |
"hard" |
happy |
| test_difficulty_empty_tables | 0 tables (edge case) | [] |
Defined behavior (error or "easy") |
edge |
Run: pytest tests/test_f004_dataset.py::TestClassifyDifficulty -v
compute_gold_answer
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_compute_valid_select | Valid SELECT on real SQLite DB | "SELECT COUNT(*) FROM singer", valid db_path |
Integer result | happy |
| test_compute_multirow | Query returning multiple rows | "SELECT * FROM singer LIMIT 3", valid db_path |
List/table result | happy |
| test_compute_invalid_sql | Syntactically invalid SQL | "SELCT * FORM x", valid db_path |
Raises sqlite3.Error |
error |
| test_compute_missing_table | SQL references non-existent table | "SELECT * FROM nonexistent", valid db_path |
Raises sqlite3.Error |
error |
| test_compute_missing_db | Database file does not exist | "SELECT 1", /tmp/nonexistent.sqlite |
Raises sqlite3.Error or FileNotFoundError |
error |
| test_compute_empty_result | Query returns no rows | "SELECT * FROM singer WHERE 1=0", valid db_path |
Empty result (e.g., []) |
edge |
| test_compute_null_result | Query returning NULL | "SELECT NULL", valid db_path |
None or defined null handling |
edge |
Run: pytest tests/test_f004_dataset.py::TestComputeGoldAnswer -v
assign_splits
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_assign_train_from_spider_train | Spider train split maps to train | Questions with spider_split="train" |
split="train" |
happy |
| test_assign_eval_from_spider_validation | Spider validation split maps to eval | Questions with spider_split="validation" |
split="eval" |
happy |
| test_assign_preserves_all_questions | No questions are dropped | 10 input questions | 10 output questions | happy |
| test_assign_mixed_splits | Mix of train and validation | 7 train + 3 validation | 7 train + 3 eval | happy |
| test_assign_all_train | All questions from train split | 5 train questions | All split="train" |
edge |
| test_assign_all_eval | All questions from validation split | 5 validation questions | All split="eval" |
edge |
Run: pytest tests/test_f004_dataset.py::TestAssignSplits -v
download_spider_databases
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_download_creates_files | Download for known db_ids produces SQLite files | ["concert_singer"], temp dir |
Dict mapping db_id to valid Path, file exists | happy |
| test_download_skips_existing | Existing database is not re-downloaded | Pre-existing file, same db_id | File unchanged, no download attempt | happy |
| test_download_unknown_db | Unknown db_id | ["nonexistent_db_xyz"] |
Raises FileNotFoundError |
error |
| test_download_empty_list | Empty db_ids list | [] |
Returns empty dict | edge |
| test_download_returns_correct_paths | Paths follow {output_dir}/{db_id}/{db_id}.sqlite |
["concert_singer"] |
Path matches expected pattern | happy |
Run: pytest tests/test_f004_dataset.py::TestDownloadSpiderDatabases -v
load_spider_questions
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_load_returns_questions | Valid db_ids produce question dicts | ["concert_singer"] |
Non-empty list of dicts with query and db_id fields |
happy |
| test_load_multiple_dbs | Multiple db_ids returns questions from all | ["concert_singer", "pets_1"] |
Questions from both databases present | happy |
| test_load_includes_both_splits | Both train and validation splits loaded | ["concert_singer"] |
Questions with both spider splits present | happy |
| test_load_connection_failure | Network unavailable (mocked) | Any db_ids, no network | Raises ConnectionError |
error |
| test_load_empty_list | Empty db_ids list | [] |
Returns empty list | edge |
Run: pytest tests/test_f004_dataset.py::TestLoadSpiderQuestions -v
validate_dataset
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_validate_clean_dataset | Valid dataset passes all checks | Well-formed dataset | Empty error list | happy |
| test_validate_missing_field | Record missing required field | Record without gold_sql |
Error list includes missing field message | error |
| test_validate_duplicate_ids | Two records share same question_id | Duplicate "concert_singer_train_001" |
Error list includes duplicate message | error |
| test_validate_gold_sql_fails | gold_sql that does not execute | Record with broken SQL | Error list includes SQL execution message | error |
| test_validate_gold_answer_mismatch | gold_answer does not match re-execution | Record with wrong gold_answer | Error list includes mismatch message | error |
| test_validate_difficulty_distribution | Distribution check ~40/40/20 | Dataset with extreme skew (100% easy) | Warning or error about distribution | edge |
| test_validate_clean_splits | No question appears in both splits | Dataset with clean splits | No split errors | happy |
| test_validate_cross_split_leak | Same question in train and eval | Duplicate across splits | Error detected | error |
Run: pytest tests/test_f004_dataset.py::TestValidateDataset -v
2. Integration Tests
Flow: Primary Curation Pipeline
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Read db_list.json for target database IDs | Returns list of 10 db_ids | len(db_ids) >= 10 |
| 2 | Download Spider SQLite databases | All db_ids have corresponding .sqlite files | All paths exist, files > 0 bytes |
| 3 | Load raw Spider questions from HuggingFace | Questions loaded for all target db_ids | len(questions) > 0, each has db_id and query |
| 4 | Compute gold_answer for each question | gold_answer populated, failed queries skipped | No None gold_answers in output; skip count logged |
| 5 | Classify answer_type for each question | answer_type is one of 5 valid values | All values in {"integer","float","string","list","table"} |
| 6 | Extract tables_involved from gold_sql | Each question has non-empty tables_involved | All lists non-empty, all entries are strings |
| 7 | Classify difficulty from tables_involved | difficulty is one of 3 valid values | All values in {"easy","medium","hard"} |
| 8 | Assign train/eval split | Each question has valid split | All values in {"train","eval"} |
| 9 | Generate question_id | All IDs unique and match format | Regex match, len(set(ids)) == len(ids) |
| 10 | Validate full dataset | validate_dataset returns empty error list | len(errors) == 0 |
| 11 | Write output JSON files | questions_train.json and questions_eval.json exist | Files parseable as JSON, combined count >= 100 |
Run: pytest tests/test_f004_integration.py::TestCurationPipeline -v
Flow: Gold SQL Execution Failure (Alternative)
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Provide question with broken gold_sql | gold_sql raises sqlite3.Error | Exception caught, not propagated |
| 2 | Check warning logged | Log contains db_id and error details | Log output includes warning |
| 3 | Question excluded from final dataset | Output does not contain the broken question | question_id absent from output |
| 4 | Remaining questions processed | Pipeline continues without interruption | Other questions have valid gold_answer |
Run: pytest tests/test_f004_integration.py::TestGoldSqlFailure -v
Flow: Validate-Only Mode (Alternative)
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Invoke script with --validate flag |
Does not download or regenerate data | No network calls made |
| 2 | Load existing questions_train.json and questions_eval.json | Files read successfully | No FileNotFoundError |
| 3 | Locate SQLite databases in data/databases/ | All referenced databases found | All db_paths valid |
| 4 | Run validate_dataset() | Returns list of errors (may be empty) | Return type is list[str] |
| 5 | Exit code reflects validation result | 0 if valid, 1 if invalid | Process exit code matches |
Run: pytest tests/test_f004_integration.py::TestValidateOnlyMode -v
3. API Tests
No API endpoints defined for F004. This feature is a standalone curation script.
4. E2E Tests
Scenario: Full Dataset Generation
Setup: Clean environment with no existing output files. Network access available. data/questions/db_list.json contains 10 target database IDs.
Actions:
- Run the curation script end-to-end (no flags)
- Wait for completion
Expected:
data/questions/questions_train.jsonexists and contains valid JSON arraydata/questions/questions_eval.jsonexists and contains valid JSON array- Combined question count >= 100
- All questions pass
validate_dataset()with zero errors - At least 8 distinct
database_namevalues represented - Train/eval split approximately 70/30 (+/- 10%)
- All three difficulty levels present
- All five answer_type values present (or at least 3)
Run: python scripts/curate_dataset.py && python scripts/curate_dataset.py --validate
Scenario: Validate-Only on Pre-Generated Data
Setup: Output JSON files already exist from a prior run. SQLite databases present in data/databases/.
Actions:
- Run the curation script with
--validateflag only
Expected:
- No new files created or modified
- Validation output printed to stdout
- Exit code 0 if data is valid
Run: python scripts/curate_dataset.py --validate
Scenario: Idempotent Re-Run
Setup: Output JSON files already exist from a prior run.
Actions:
- Run the curation script again (full mode)
- Compare output files
Expected:
- Output files are regenerated
- Same question count (deterministic for same input)
- Database files not re-downloaded (skip existing)
Run: python scripts/curate_dataset.py
5. Edge Cases Checklist
- Null/None gold_answer values handled gracefully
- Empty string gold_sql skipped or rejected
- SQL with unicode characters in table/column names
- Very large query results (1000+ rows) handled by compute_gold_answer
- Database file that exists but is corrupt (0 bytes or invalid SQLite)
- db_list.json missing or empty
- db_list.json with duplicate db_ids
- Question with gold_sql containing multiple statements (semicolons)
- Question where gold_sql returns different results on re-execution (non-deterministic)
- Tables_involved extraction with SQL using CTEs (WITH clause)
- Tables_involved extraction with SQL using UNION across different tables
- Extremely long gold_sql (> 1000 chars)
- Database with no tables (empty schema)
- Retry behavior on transient network failure during HuggingFace download
- Retry behavior on transient network failure during database download
- Concurrent access to same SQLite file (if parallelized)
- Output JSON file encoding (UTF-8) with special characters in question_text
6. Evidence Requirements
| Category | Evidence Type | Example |
|---|---|---|
| Unit tests | pytest output | X passed, Y skipped |
| Integration | pytest output | X passed |
| E2E | Script output + file inspection | Generated 105 questions, Validation passed |
| Output files | JSON structure inspection | jq length questions_train.json returns count |
| Skip handling | Log output | Warning messages for skipped questions |
| Validation | Exit code | echo $? returns 0 |