# Verification Specification

**Feature:** F004
**Generated from:** specs/F004-VERIFICATION_INPUT.json
**Generated:** 2026-03-24

---

## 1. Unit Tests

### EnrichedQuestionRecord (Type)

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_record_all_fields_present | All 9 required fields populated | Full valid record dict | All fields accessible, no missing keys | happy |
| test_record_question_id_format | question_id matches `{db_id}_{split}_{index:03d}` | `"concert_singer_train_007"` | Passes regex `^[a-z_]+_(train|eval)_\d{3}$` | happy |
| test_record_question_id_invalid | Rejects malformed question_id | `"bad-id"` | Validation error or detectable as invalid | error |
| test_record_answer_type_enum | answer_type is one of allowed values | `"integer"`, `"float"`, `"string"`, `"list"`, `"table"` | Each accepted | happy |
| test_record_answer_type_invalid | Rejects unknown answer_type | `"boolean"` | Rejected or flagged | error |
| test_record_difficulty_enum | difficulty is one of allowed values | `"easy"`, `"medium"`, `"hard"` | Each accepted | happy |
| test_record_difficulty_invalid | Rejects unknown difficulty | `"extreme"` | Rejected or flagged | error |
| test_record_split_enum | split is one of allowed values | `"train"`, `"eval"` | Each accepted | happy |
| test_record_split_invalid | Rejects unknown split | `"test"` | Rejected or flagged | error |
| test_record_tables_involved_nonempty | tables_involved has at least one entry | `["students"]` | Accepted | happy |
| test_record_tables_involved_empty | Empty tables_involved is rejected | `[]` | Rejected or flagged by validation | edge |
| test_record_gold_sql_nonempty | gold_sql is a non-empty string | `"SELECT COUNT(*) FROM students"` | Accepted | happy |
| test_record_gold_sql_empty | Empty gold_sql is rejected | `""` | Rejected or flagged | edge |
| test_record_gold_answer_types | gold_answer can hold int, float, str, list, list-of-lists | `42`, `3.14`, `"Alice"`, `[1,2]`, `[[1,"a"]]` | Each stored and retrievable | happy |

**Run:** `pytest tests/test_f004_dataset.py::TestEnrichedQuestionRecord -v`

---

### classify_answer_type

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_classify_integer | Single integer answer | `42` | `"integer"` | happy |
| test_classify_float | Single float answer | `3.14` | `"float"` | happy |
| test_classify_string | Single string answer | `"Alice"` | `"string"` | happy |
| test_classify_list | Flat list (single column, multiple rows) | `[1, 2, 3]` | `"list"` | happy |
| test_classify_table | List of tuples/lists (multi-column result) | `[(1, "a"), (2, "b")]` | `"table"` | happy |
| test_classify_none | None/null answer | `None` | Defined behavior (error or specific type) | edge |
| test_classify_empty_list | Empty list | `[]` | `"list"` or defined behavior | edge |
| test_classify_single_row_tuple | Single-element tuple | `(42,)` | `"integer"` (unwrapped) or `"list"` | edge |
| test_classify_nested_single | Single-row multi-column | `[(1, "a")]` | `"table"` | edge |
| test_classify_boolean | Boolean answer | `True` | Defined fallback behavior | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestClassifyAnswerType -v`

---

### extract_tables_involved

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_extract_single_table | Simple SELECT from one table | `"SELECT * FROM students"` | `["students"]` | happy |
| test_extract_join | JOIN with two tables | `"SELECT s.name FROM students s JOIN courses c ON s.id = c.student_id"` | `["courses", "students"]` (sorted) | happy |
| test_extract_subquery | Subquery referencing different table | `"SELECT * FROM students WHERE id IN (SELECT student_id FROM enrollments)"` | `["enrollments", "students"]` (sorted) | happy |
| test_extract_deduplication | Same table referenced multiple times | `"SELECT a.x, b.y FROM t1 a JOIN t1 b ON a.id = b.id"` | `["t1"]` (deduplicated) | happy |
| test_extract_case_insensitive | Mixed case SQL keywords | `"select * FROM Students"` | `["Students"]` or `["students"]` (consistent) | edge |
| test_extract_with_alias | Table alias should resolve to table name | `"SELECT s.name FROM students AS s"` | `["students"]` | edge |
| test_extract_multiple_joins | Three or more tables joined | `"SELECT * FROM a JOIN b ON a.id=b.id JOIN c ON b.id=c.id"` | `["a", "b", "c"]` (sorted) | happy |
| test_extract_empty_sql | Empty SQL string | `""` | `[]` or error | edge |
| test_extract_no_from | SQL without FROM clause | `"SELECT 1+1"` | `[]` | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestExtractTablesInvolved -v`

---

### classify_difficulty

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_difficulty_easy_1_table | 1 table involved | `["students"]` | `"easy"` | happy |
| test_difficulty_easy_2_tables | 2 tables involved | `["students", "courses"]` | `"easy"` | happy |
| test_difficulty_medium_3_tables | 3 tables involved | `["a", "b", "c"]` | `"medium"` | happy |
| test_difficulty_hard_4_tables | 4 tables involved | `["a", "b", "c", "d"]` | `"hard"` | happy |
| test_difficulty_hard_many_tables | 6+ tables involved | `["a", "b", "c", "d", "e", "f"]` | `"hard"` | happy |
| test_difficulty_empty_tables | 0 tables (edge case) | `[]` | Defined behavior (error or `"easy"`) | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestClassifyDifficulty -v`

---

### compute_gold_answer

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_compute_valid_select | Valid SELECT on real SQLite DB | `"SELECT COUNT(*) FROM singer"`, valid db_path | Integer result | happy |
| test_compute_multirow | Query returning multiple rows | `"SELECT * FROM singer LIMIT 3"`, valid db_path | List/table result | happy |
| test_compute_invalid_sql | Syntactically invalid SQL | `"SELCT * FORM x"`, valid db_path | Raises `sqlite3.Error` | error |
| test_compute_missing_table | SQL references non-existent table | `"SELECT * FROM nonexistent"`, valid db_path | Raises `sqlite3.Error` | error |
| test_compute_missing_db | Database file does not exist | `"SELECT 1"`, `/tmp/nonexistent.sqlite` | Raises `sqlite3.Error` or `FileNotFoundError` | error |
| test_compute_empty_result | Query returns no rows | `"SELECT * FROM singer WHERE 1=0"`, valid db_path | Empty result (e.g., `[]`) | edge |
| test_compute_null_result | Query returning NULL | `"SELECT NULL"`, valid db_path | `None` or defined null handling | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestComputeGoldAnswer -v`

---

### assign_splits

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_assign_train_from_spider_train | Spider train split maps to train | Questions with `spider_split="train"` | `split="train"` | happy |
| test_assign_eval_from_spider_validation | Spider validation split maps to eval | Questions with `spider_split="validation"` | `split="eval"` | happy |
| test_assign_preserves_all_questions | No questions are dropped | 10 input questions | 10 output questions | happy |
| test_assign_mixed_splits | Mix of train and validation | 7 train + 3 validation | 7 train + 3 eval | happy |
| test_assign_all_train | All questions from train split | 5 train questions | All `split="train"` | edge |
| test_assign_all_eval | All questions from validation split | 5 validation questions | All `split="eval"` | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestAssignSplits -v`

---

### download_spider_databases

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_download_creates_files | Download for known db_ids produces SQLite files | `["concert_singer"]`, temp dir | Dict mapping db_id to valid Path, file exists | happy |
| test_download_skips_existing | Existing database is not re-downloaded | Pre-existing file, same db_id | File unchanged, no download attempt | happy |
| test_download_unknown_db | Unknown db_id | `["nonexistent_db_xyz"]` | Raises `FileNotFoundError` | error |
| test_download_empty_list | Empty db_ids list | `[]` | Returns empty dict | edge |
| test_download_returns_correct_paths | Paths follow `{output_dir}/{db_id}/{db_id}.sqlite` | `["concert_singer"]` | Path matches expected pattern | happy |

**Run:** `pytest tests/test_f004_dataset.py::TestDownloadSpiderDatabases -v`

---

### load_spider_questions

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_load_returns_questions | Valid db_ids produce question dicts | `["concert_singer"]` | Non-empty list of dicts with `query` and `db_id` fields | happy |
| test_load_multiple_dbs | Multiple db_ids returns questions from all | `["concert_singer", "pets_1"]` | Questions from both databases present | happy |
| test_load_includes_both_splits | Both train and validation splits loaded | `["concert_singer"]` | Questions with both spider splits present | happy |
| test_load_connection_failure | Network unavailable (mocked) | Any db_ids, no network | Raises `ConnectionError` | error |
| test_load_empty_list | Empty db_ids list | `[]` | Returns empty list | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestLoadSpiderQuestions -v`

---

### validate_dataset

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_validate_clean_dataset | Valid dataset passes all checks | Well-formed dataset | Empty error list | happy |
| test_validate_missing_field | Record missing required field | Record without `gold_sql` | Error list includes missing field message | error |
| test_validate_duplicate_ids | Two records share same question_id | Duplicate `"concert_singer_train_001"` | Error list includes duplicate message | error |
| test_validate_gold_sql_fails | gold_sql that does not execute | Record with broken SQL | Error list includes SQL execution message | error |
| test_validate_gold_answer_mismatch | gold_answer does not match re-execution | Record with wrong gold_answer | Error list includes mismatch message | error |
| test_validate_difficulty_distribution | Distribution check ~40/40/20 | Dataset with extreme skew (100% easy) | Warning or error about distribution | edge |
| test_validate_clean_splits | No question appears in both splits | Dataset with clean splits | No split errors | happy |
| test_validate_cross_split_leak | Same question in train and eval | Duplicate across splits | Error detected | error |

**Run:** `pytest tests/test_f004_dataset.py::TestValidateDataset -v`

---

## 2. Integration Tests

### Flow: Primary Curation Pipeline

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Read db_list.json for target database IDs | Returns list of 10 db_ids | `len(db_ids) >= 10` |
| 2 | Download Spider SQLite databases | All db_ids have corresponding .sqlite files | All paths exist, files > 0 bytes |
| 3 | Load raw Spider questions from HuggingFace | Questions loaded for all target db_ids | `len(questions) > 0`, each has `db_id` and `query` |
| 4 | Compute gold_answer for each question | gold_answer populated, failed queries skipped | No None gold_answers in output; skip count logged |
| 5 | Classify answer_type for each question | answer_type is one of 5 valid values | All values in `{"integer","float","string","list","table"}` |
| 6 | Extract tables_involved from gold_sql | Each question has non-empty tables_involved | All lists non-empty, all entries are strings |
| 7 | Classify difficulty from tables_involved | difficulty is one of 3 valid values | All values in `{"easy","medium","hard"}` |
| 8 | Assign train/eval split | Each question has valid split | All values in `{"train","eval"}` |
| 9 | Generate question_id | All IDs unique and match format | Regex match, `len(set(ids)) == len(ids)` |
| 10 | Validate full dataset | validate_dataset returns empty error list | `len(errors) == 0` |
| 11 | Write output JSON files | questions_train.json and questions_eval.json exist | Files parseable as JSON, combined count >= 100 |

**Run:** `pytest tests/test_f004_integration.py::TestCurationPipeline -v`

---

### Flow: Gold SQL Execution Failure (Alternative)

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Provide question with broken gold_sql | gold_sql raises sqlite3.Error | Exception caught, not propagated |
| 2 | Check warning logged | Log contains db_id and error details | Log output includes warning |
| 3 | Question excluded from final dataset | Output does not contain the broken question | question_id absent from output |
| 4 | Remaining questions processed | Pipeline continues without interruption | Other questions have valid gold_answer |

**Run:** `pytest tests/test_f004_integration.py::TestGoldSqlFailure -v`

---

### Flow: Validate-Only Mode (Alternative)

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Invoke script with `--validate` flag | Does not download or regenerate data | No network calls made |
| 2 | Load existing questions_train.json and questions_eval.json | Files read successfully | No FileNotFoundError |
| 3 | Locate SQLite databases in data/databases/ | All referenced databases found | All db_paths valid |
| 4 | Run validate_dataset() | Returns list of errors (may be empty) | Return type is list[str] |
| 5 | Exit code reflects validation result | 0 if valid, 1 if invalid | Process exit code matches |

**Run:** `pytest tests/test_f004_integration.py::TestValidateOnlyMode -v`

---

## 3. API Tests

No API endpoints defined for F004. This feature is a standalone curation script.

---

## 4. E2E Tests

### Scenario: Full Dataset Generation

**Setup:** Clean environment with no existing output files. Network access available. `data/questions/db_list.json` contains 10 target database IDs.

**Actions:**
1. Run the curation script end-to-end (no flags)
2. Wait for completion

**Expected:**
- `data/questions/questions_train.json` exists and contains valid JSON array
- `data/questions/questions_eval.json` exists and contains valid JSON array
- Combined question count >= 100
- All questions pass `validate_dataset()` with zero errors
- At least 8 distinct `database_name` values represented
- Train/eval split approximately 70/30 (+/- 10%)
- All three difficulty levels present
- All five answer_type values present (or at least 3)

**Run:** `python scripts/curate_dataset.py && python scripts/curate_dataset.py --validate`

---

### Scenario: Validate-Only on Pre-Generated Data

**Setup:** Output JSON files already exist from a prior run. SQLite databases present in `data/databases/`.

**Actions:**
1. Run the curation script with `--validate` flag only

**Expected:**
- No new files created or modified
- Validation output printed to stdout
- Exit code 0 if data is valid

**Run:** `python scripts/curate_dataset.py --validate`

---

### Scenario: Idempotent Re-Run

**Setup:** Output JSON files already exist from a prior run.

**Actions:**
1. Run the curation script again (full mode)
2. Compare output files

**Expected:**
- Output files are regenerated
- Same question count (deterministic for same input)
- Database files not re-downloaded (skip existing)

**Run:** `python scripts/curate_dataset.py`

---

## 5. Edge Cases Checklist

- [ ] Null/None gold_answer values handled gracefully
- [ ] Empty string gold_sql skipped or rejected
- [ ] SQL with unicode characters in table/column names
- [ ] Very large query results (1000+ rows) handled by compute_gold_answer
- [ ] Database file that exists but is corrupt (0 bytes or invalid SQLite)
- [ ] db_list.json missing or empty
- [ ] db_list.json with duplicate db_ids
- [ ] Question with gold_sql containing multiple statements (semicolons)
- [ ] Question where gold_sql returns different results on re-execution (non-deterministic)
- [ ] Tables_involved extraction with SQL using CTEs (WITH clause)
- [ ] Tables_involved extraction with SQL using UNION across different tables
- [ ] Extremely long gold_sql (> 1000 chars)
- [ ] Database with no tables (empty schema)
- [ ] Retry behavior on transient network failure during HuggingFace download
- [ ] Retry behavior on transient network failure during database download
- [ ] Concurrent access to same SQLite file (if parallelized)
- [ ] Output JSON file encoding (UTF-8) with special characters in question_text

---

## 6. Evidence Requirements

| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `X passed, Y skipped` |
| Integration | pytest output | `X passed` |
| E2E | Script output + file inspection | `Generated 105 questions`, `Validation passed` |
| Output files | JSON structure inspection | `jq length questions_train.json` returns count |
| Skip handling | Log output | Warning messages for skipped questions |
| Validation | Exit code | `echo $?` returns `0` |