# Verification Specification **Feature:** F004 **Generated from:** specs/F004-VERIFICATION_INPUT.json **Generated:** 2026-03-24 --- ## 1. Unit Tests ### EnrichedQuestionRecord (Type) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_record_all_fields_present | All 9 required fields populated | Full valid record dict | All fields accessible, no missing keys | happy | | test_record_question_id_format | question_id matches `{db_id}_{split}_{index:03d}` | `"concert_singer_train_007"` | Passes regex `^[a-z_]+_(train|eval)_\d{3}$` | happy | | test_record_question_id_invalid | Rejects malformed question_id | `"bad-id"` | Validation error or detectable as invalid | error | | test_record_answer_type_enum | answer_type is one of allowed values | `"integer"`, `"float"`, `"string"`, `"list"`, `"table"` | Each accepted | happy | | test_record_answer_type_invalid | Rejects unknown answer_type | `"boolean"` | Rejected or flagged | error | | test_record_difficulty_enum | difficulty is one of allowed values | `"easy"`, `"medium"`, `"hard"` | Each accepted | happy | | test_record_difficulty_invalid | Rejects unknown difficulty | `"extreme"` | Rejected or flagged | error | | test_record_split_enum | split is one of allowed values | `"train"`, `"eval"` | Each accepted | happy | | test_record_split_invalid | Rejects unknown split | `"test"` | Rejected or flagged | error | | test_record_tables_involved_nonempty | tables_involved has at least one entry | `["students"]` | Accepted | happy | | test_record_tables_involved_empty | Empty tables_involved is rejected | `[]` | Rejected or flagged by validation | edge | | test_record_gold_sql_nonempty | gold_sql is a non-empty string | `"SELECT COUNT(*) FROM students"` | Accepted | happy | | test_record_gold_sql_empty | Empty gold_sql is rejected | `""` | Rejected or flagged | edge | | test_record_gold_answer_types | gold_answer can hold int, float, str, list, list-of-lists | `42`, `3.14`, `"Alice"`, `[1,2]`, `[[1,"a"]]` | Each stored and retrievable | happy | **Run:** `pytest tests/test_f004_dataset.py::TestEnrichedQuestionRecord -v` --- ### classify_answer_type | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_classify_integer | Single integer answer | `42` | `"integer"` | happy | | test_classify_float | Single float answer | `3.14` | `"float"` | happy | | test_classify_string | Single string answer | `"Alice"` | `"string"` | happy | | test_classify_list | Flat list (single column, multiple rows) | `[1, 2, 3]` | `"list"` | happy | | test_classify_table | List of tuples/lists (multi-column result) | `[(1, "a"), (2, "b")]` | `"table"` | happy | | test_classify_none | None/null answer | `None` | Defined behavior (error or specific type) | edge | | test_classify_empty_list | Empty list | `[]` | `"list"` or defined behavior | edge | | test_classify_single_row_tuple | Single-element tuple | `(42,)` | `"integer"` (unwrapped) or `"list"` | edge | | test_classify_nested_single | Single-row multi-column | `[(1, "a")]` | `"table"` | edge | | test_classify_boolean | Boolean answer | `True` | Defined fallback behavior | edge | **Run:** `pytest tests/test_f004_dataset.py::TestClassifyAnswerType -v` --- ### extract_tables_involved | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_extract_single_table | Simple SELECT from one table | `"SELECT * FROM students"` | `["students"]` | happy | | test_extract_join | JOIN with two tables | `"SELECT s.name FROM students s JOIN courses c ON s.id = c.student_id"` | `["courses", "students"]` (sorted) | happy | | test_extract_subquery | Subquery referencing different table | `"SELECT * FROM students WHERE id IN (SELECT student_id FROM enrollments)"` | `["enrollments", "students"]` (sorted) | happy | | test_extract_deduplication | Same table referenced multiple times | `"SELECT a.x, b.y FROM t1 a JOIN t1 b ON a.id = b.id"` | `["t1"]` (deduplicated) | happy | | test_extract_case_insensitive | Mixed case SQL keywords | `"select * FROM Students"` | `["Students"]` or `["students"]` (consistent) | edge | | test_extract_with_alias | Table alias should resolve to table name | `"SELECT s.name FROM students AS s"` | `["students"]` | edge | | test_extract_multiple_joins | Three or more tables joined | `"SELECT * FROM a JOIN b ON a.id=b.id JOIN c ON b.id=c.id"` | `["a", "b", "c"]` (sorted) | happy | | test_extract_empty_sql | Empty SQL string | `""` | `[]` or error | edge | | test_extract_no_from | SQL without FROM clause | `"SELECT 1+1"` | `[]` | edge | **Run:** `pytest tests/test_f004_dataset.py::TestExtractTablesInvolved -v` --- ### classify_difficulty | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_difficulty_easy_1_table | 1 table involved | `["students"]` | `"easy"` | happy | | test_difficulty_easy_2_tables | 2 tables involved | `["students", "courses"]` | `"easy"` | happy | | test_difficulty_medium_3_tables | 3 tables involved | `["a", "b", "c"]` | `"medium"` | happy | | test_difficulty_hard_4_tables | 4 tables involved | `["a", "b", "c", "d"]` | `"hard"` | happy | | test_difficulty_hard_many_tables | 6+ tables involved | `["a", "b", "c", "d", "e", "f"]` | `"hard"` | happy | | test_difficulty_empty_tables | 0 tables (edge case) | `[]` | Defined behavior (error or `"easy"`) | edge | **Run:** `pytest tests/test_f004_dataset.py::TestClassifyDifficulty -v` --- ### compute_gold_answer | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_compute_valid_select | Valid SELECT on real SQLite DB | `"SELECT COUNT(*) FROM singer"`, valid db_path | Integer result | happy | | test_compute_multirow | Query returning multiple rows | `"SELECT * FROM singer LIMIT 3"`, valid db_path | List/table result | happy | | test_compute_invalid_sql | Syntactically invalid SQL | `"SELCT * FORM x"`, valid db_path | Raises `sqlite3.Error` | error | | test_compute_missing_table | SQL references non-existent table | `"SELECT * FROM nonexistent"`, valid db_path | Raises `sqlite3.Error` | error | | test_compute_missing_db | Database file does not exist | `"SELECT 1"`, `/tmp/nonexistent.sqlite` | Raises `sqlite3.Error` or `FileNotFoundError` | error | | test_compute_empty_result | Query returns no rows | `"SELECT * FROM singer WHERE 1=0"`, valid db_path | Empty result (e.g., `[]`) | edge | | test_compute_null_result | Query returning NULL | `"SELECT NULL"`, valid db_path | `None` or defined null handling | edge | **Run:** `pytest tests/test_f004_dataset.py::TestComputeGoldAnswer -v` --- ### assign_splits | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_assign_train_from_spider_train | Spider train split maps to train | Questions with `spider_split="train"` | `split="train"` | happy | | test_assign_eval_from_spider_validation | Spider validation split maps to eval | Questions with `spider_split="validation"` | `split="eval"` | happy | | test_assign_preserves_all_questions | No questions are dropped | 10 input questions | 10 output questions | happy | | test_assign_mixed_splits | Mix of train and validation | 7 train + 3 validation | 7 train + 3 eval | happy | | test_assign_all_train | All questions from train split | 5 train questions | All `split="train"` | edge | | test_assign_all_eval | All questions from validation split | 5 validation questions | All `split="eval"` | edge | **Run:** `pytest tests/test_f004_dataset.py::TestAssignSplits -v` --- ### download_spider_databases | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_download_creates_files | Download for known db_ids produces SQLite files | `["concert_singer"]`, temp dir | Dict mapping db_id to valid Path, file exists | happy | | test_download_skips_existing | Existing database is not re-downloaded | Pre-existing file, same db_id | File unchanged, no download attempt | happy | | test_download_unknown_db | Unknown db_id | `["nonexistent_db_xyz"]` | Raises `FileNotFoundError` | error | | test_download_empty_list | Empty db_ids list | `[]` | Returns empty dict | edge | | test_download_returns_correct_paths | Paths follow `{output_dir}/{db_id}/{db_id}.sqlite` | `["concert_singer"]` | Path matches expected pattern | happy | **Run:** `pytest tests/test_f004_dataset.py::TestDownloadSpiderDatabases -v` --- ### load_spider_questions | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_load_returns_questions | Valid db_ids produce question dicts | `["concert_singer"]` | Non-empty list of dicts with `query` and `db_id` fields | happy | | test_load_multiple_dbs | Multiple db_ids returns questions from all | `["concert_singer", "pets_1"]` | Questions from both databases present | happy | | test_load_includes_both_splits | Both train and validation splits loaded | `["concert_singer"]` | Questions with both spider splits present | happy | | test_load_connection_failure | Network unavailable (mocked) | Any db_ids, no network | Raises `ConnectionError` | error | | test_load_empty_list | Empty db_ids list | `[]` | Returns empty list | edge | **Run:** `pytest tests/test_f004_dataset.py::TestLoadSpiderQuestions -v` --- ### validate_dataset | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_validate_clean_dataset | Valid dataset passes all checks | Well-formed dataset | Empty error list | happy | | test_validate_missing_field | Record missing required field | Record without `gold_sql` | Error list includes missing field message | error | | test_validate_duplicate_ids | Two records share same question_id | Duplicate `"concert_singer_train_001"` | Error list includes duplicate message | error | | test_validate_gold_sql_fails | gold_sql that does not execute | Record with broken SQL | Error list includes SQL execution message | error | | test_validate_gold_answer_mismatch | gold_answer does not match re-execution | Record with wrong gold_answer | Error list includes mismatch message | error | | test_validate_difficulty_distribution | Distribution check ~40/40/20 | Dataset with extreme skew (100% easy) | Warning or error about distribution | edge | | test_validate_clean_splits | No question appears in both splits | Dataset with clean splits | No split errors | happy | | test_validate_cross_split_leak | Same question in train and eval | Duplicate across splits | Error detected | error | **Run:** `pytest tests/test_f004_dataset.py::TestValidateDataset -v` --- ## 2. Integration Tests ### Flow: Primary Curation Pipeline | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Read db_list.json for target database IDs | Returns list of 10 db_ids | `len(db_ids) >= 10` | | 2 | Download Spider SQLite databases | All db_ids have corresponding .sqlite files | All paths exist, files > 0 bytes | | 3 | Load raw Spider questions from HuggingFace | Questions loaded for all target db_ids | `len(questions) > 0`, each has `db_id` and `query` | | 4 | Compute gold_answer for each question | gold_answer populated, failed queries skipped | No None gold_answers in output; skip count logged | | 5 | Classify answer_type for each question | answer_type is one of 5 valid values | All values in `{"integer","float","string","list","table"}` | | 6 | Extract tables_involved from gold_sql | Each question has non-empty tables_involved | All lists non-empty, all entries are strings | | 7 | Classify difficulty from tables_involved | difficulty is one of 3 valid values | All values in `{"easy","medium","hard"}` | | 8 | Assign train/eval split | Each question has valid split | All values in `{"train","eval"}` | | 9 | Generate question_id | All IDs unique and match format | Regex match, `len(set(ids)) == len(ids)` | | 10 | Validate full dataset | validate_dataset returns empty error list | `len(errors) == 0` | | 11 | Write output JSON files | questions_train.json and questions_eval.json exist | Files parseable as JSON, combined count >= 100 | **Run:** `pytest tests/test_f004_integration.py::TestCurationPipeline -v` --- ### Flow: Gold SQL Execution Failure (Alternative) | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Provide question with broken gold_sql | gold_sql raises sqlite3.Error | Exception caught, not propagated | | 2 | Check warning logged | Log contains db_id and error details | Log output includes warning | | 3 | Question excluded from final dataset | Output does not contain the broken question | question_id absent from output | | 4 | Remaining questions processed | Pipeline continues without interruption | Other questions have valid gold_answer | **Run:** `pytest tests/test_f004_integration.py::TestGoldSqlFailure -v` --- ### Flow: Validate-Only Mode (Alternative) | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Invoke script with `--validate` flag | Does not download or regenerate data | No network calls made | | 2 | Load existing questions_train.json and questions_eval.json | Files read successfully | No FileNotFoundError | | 3 | Locate SQLite databases in data/databases/ | All referenced databases found | All db_paths valid | | 4 | Run validate_dataset() | Returns list of errors (may be empty) | Return type is list[str] | | 5 | Exit code reflects validation result | 0 if valid, 1 if invalid | Process exit code matches | **Run:** `pytest tests/test_f004_integration.py::TestValidateOnlyMode -v` --- ## 3. API Tests No API endpoints defined for F004. This feature is a standalone curation script. --- ## 4. E2E Tests ### Scenario: Full Dataset Generation **Setup:** Clean environment with no existing output files. Network access available. `data/questions/db_list.json` contains 10 target database IDs. **Actions:** 1. Run the curation script end-to-end (no flags) 2. Wait for completion **Expected:** - `data/questions/questions_train.json` exists and contains valid JSON array - `data/questions/questions_eval.json` exists and contains valid JSON array - Combined question count >= 100 - All questions pass `validate_dataset()` with zero errors - At least 8 distinct `database_name` values represented - Train/eval split approximately 70/30 (+/- 10%) - All three difficulty levels present - All five answer_type values present (or at least 3) **Run:** `python scripts/curate_dataset.py && python scripts/curate_dataset.py --validate` --- ### Scenario: Validate-Only on Pre-Generated Data **Setup:** Output JSON files already exist from a prior run. SQLite databases present in `data/databases/`. **Actions:** 1. Run the curation script with `--validate` flag only **Expected:** - No new files created or modified - Validation output printed to stdout - Exit code 0 if data is valid **Run:** `python scripts/curate_dataset.py --validate` --- ### Scenario: Idempotent Re-Run **Setup:** Output JSON files already exist from a prior run. **Actions:** 1. Run the curation script again (full mode) 2. Compare output files **Expected:** - Output files are regenerated - Same question count (deterministic for same input) - Database files not re-downloaded (skip existing) **Run:** `python scripts/curate_dataset.py` --- ## 5. Edge Cases Checklist - [ ] Null/None gold_answer values handled gracefully - [ ] Empty string gold_sql skipped or rejected - [ ] SQL with unicode characters in table/column names - [ ] Very large query results (1000+ rows) handled by compute_gold_answer - [ ] Database file that exists but is corrupt (0 bytes or invalid SQLite) - [ ] db_list.json missing or empty - [ ] db_list.json with duplicate db_ids - [ ] Question with gold_sql containing multiple statements (semicolons) - [ ] Question where gold_sql returns different results on re-execution (non-deterministic) - [ ] Tables_involved extraction with SQL using CTEs (WITH clause) - [ ] Tables_involved extraction with SQL using UNION across different tables - [ ] Extremely long gold_sql (> 1000 chars) - [ ] Database with no tables (empty schema) - [ ] Retry behavior on transient network failure during HuggingFace download - [ ] Retry behavior on transient network failure during database download - [ ] Concurrent access to same SQLite file (if parallelized) - [ ] Output JSON file encoding (UTF-8) with special characters in question_text --- ## 6. Evidence Requirements | Category | Evidence Type | Example | |----------|---------------|---------| | Unit tests | pytest output | `X passed, Y skipped` | | Integration | pytest output | `X passed` | | E2E | Script output + file inspection | `Generated 105 questions`, `Validation passed` | | Output files | JSON structure inspection | `jq length questions_train.json` returns count | | Skip handling | Log output | Warning messages for skipped questions | | Validation | Exit code | `echo $?` returns `0` |