sql_env / specs /F004-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Verification Specification

Feature: F004 Generated from: specs/F004-VERIFICATION_INPUT.json Generated: 2026-03-24


1. Unit Tests

EnrichedQuestionRecord (Type)

Test Description Input Expected Category
test_record_all_fields_present All 9 required fields populated Full valid record dict All fields accessible, no missing keys happy
test_record_question_id_format question_id matches {db_id}_{split}_{index:03d} "concert_singer_train_007" Passes regex `^[a-z_]+_(train eval)_\d{3}$`
test_record_question_id_invalid Rejects malformed question_id "bad-id" Validation error or detectable as invalid error
test_record_answer_type_enum answer_type is one of allowed values "integer", "float", "string", "list", "table" Each accepted happy
test_record_answer_type_invalid Rejects unknown answer_type "boolean" Rejected or flagged error
test_record_difficulty_enum difficulty is one of allowed values "easy", "medium", "hard" Each accepted happy
test_record_difficulty_invalid Rejects unknown difficulty "extreme" Rejected or flagged error
test_record_split_enum split is one of allowed values "train", "eval" Each accepted happy
test_record_split_invalid Rejects unknown split "test" Rejected or flagged error
test_record_tables_involved_nonempty tables_involved has at least one entry ["students"] Accepted happy
test_record_tables_involved_empty Empty tables_involved is rejected [] Rejected or flagged by validation edge
test_record_gold_sql_nonempty gold_sql is a non-empty string "SELECT COUNT(*) FROM students" Accepted happy
test_record_gold_sql_empty Empty gold_sql is rejected "" Rejected or flagged edge
test_record_gold_answer_types gold_answer can hold int, float, str, list, list-of-lists 42, 3.14, "Alice", [1,2], [[1,"a"]] Each stored and retrievable happy

Run: pytest tests/test_f004_dataset.py::TestEnrichedQuestionRecord -v


classify_answer_type

Test Description Input Expected Category
test_classify_integer Single integer answer 42 "integer" happy
test_classify_float Single float answer 3.14 "float" happy
test_classify_string Single string answer "Alice" "string" happy
test_classify_list Flat list (single column, multiple rows) [1, 2, 3] "list" happy
test_classify_table List of tuples/lists (multi-column result) [(1, "a"), (2, "b")] "table" happy
test_classify_none None/null answer None Defined behavior (error or specific type) edge
test_classify_empty_list Empty list [] "list" or defined behavior edge
test_classify_single_row_tuple Single-element tuple (42,) "integer" (unwrapped) or "list" edge
test_classify_nested_single Single-row multi-column [(1, "a")] "table" edge
test_classify_boolean Boolean answer True Defined fallback behavior edge

Run: pytest tests/test_f004_dataset.py::TestClassifyAnswerType -v


extract_tables_involved

Test Description Input Expected Category
test_extract_single_table Simple SELECT from one table "SELECT * FROM students" ["students"] happy
test_extract_join JOIN with two tables "SELECT s.name FROM students s JOIN courses c ON s.id = c.student_id" ["courses", "students"] (sorted) happy
test_extract_subquery Subquery referencing different table "SELECT * FROM students WHERE id IN (SELECT student_id FROM enrollments)" ["enrollments", "students"] (sorted) happy
test_extract_deduplication Same table referenced multiple times "SELECT a.x, b.y FROM t1 a JOIN t1 b ON a.id = b.id" ["t1"] (deduplicated) happy
test_extract_case_insensitive Mixed case SQL keywords "select * FROM Students" ["Students"] or ["students"] (consistent) edge
test_extract_with_alias Table alias should resolve to table name "SELECT s.name FROM students AS s" ["students"] edge
test_extract_multiple_joins Three or more tables joined "SELECT * FROM a JOIN b ON a.id=b.id JOIN c ON b.id=c.id" ["a", "b", "c"] (sorted) happy
test_extract_empty_sql Empty SQL string "" [] or error edge
test_extract_no_from SQL without FROM clause "SELECT 1+1" [] edge

Run: pytest tests/test_f004_dataset.py::TestExtractTablesInvolved -v


classify_difficulty

Test Description Input Expected Category
test_difficulty_easy_1_table 1 table involved ["students"] "easy" happy
test_difficulty_easy_2_tables 2 tables involved ["students", "courses"] "easy" happy
test_difficulty_medium_3_tables 3 tables involved ["a", "b", "c"] "medium" happy
test_difficulty_hard_4_tables 4 tables involved ["a", "b", "c", "d"] "hard" happy
test_difficulty_hard_many_tables 6+ tables involved ["a", "b", "c", "d", "e", "f"] "hard" happy
test_difficulty_empty_tables 0 tables (edge case) [] Defined behavior (error or "easy") edge

Run: pytest tests/test_f004_dataset.py::TestClassifyDifficulty -v


compute_gold_answer

Test Description Input Expected Category
test_compute_valid_select Valid SELECT on real SQLite DB "SELECT COUNT(*) FROM singer", valid db_path Integer result happy
test_compute_multirow Query returning multiple rows "SELECT * FROM singer LIMIT 3", valid db_path List/table result happy
test_compute_invalid_sql Syntactically invalid SQL "SELCT * FORM x", valid db_path Raises sqlite3.Error error
test_compute_missing_table SQL references non-existent table "SELECT * FROM nonexistent", valid db_path Raises sqlite3.Error error
test_compute_missing_db Database file does not exist "SELECT 1", /tmp/nonexistent.sqlite Raises sqlite3.Error or FileNotFoundError error
test_compute_empty_result Query returns no rows "SELECT * FROM singer WHERE 1=0", valid db_path Empty result (e.g., []) edge
test_compute_null_result Query returning NULL "SELECT NULL", valid db_path None or defined null handling edge

Run: pytest tests/test_f004_dataset.py::TestComputeGoldAnswer -v


assign_splits

Test Description Input Expected Category
test_assign_train_from_spider_train Spider train split maps to train Questions with spider_split="train" split="train" happy
test_assign_eval_from_spider_validation Spider validation split maps to eval Questions with spider_split="validation" split="eval" happy
test_assign_preserves_all_questions No questions are dropped 10 input questions 10 output questions happy
test_assign_mixed_splits Mix of train and validation 7 train + 3 validation 7 train + 3 eval happy
test_assign_all_train All questions from train split 5 train questions All split="train" edge
test_assign_all_eval All questions from validation split 5 validation questions All split="eval" edge

Run: pytest tests/test_f004_dataset.py::TestAssignSplits -v


download_spider_databases

Test Description Input Expected Category
test_download_creates_files Download for known db_ids produces SQLite files ["concert_singer"], temp dir Dict mapping db_id to valid Path, file exists happy
test_download_skips_existing Existing database is not re-downloaded Pre-existing file, same db_id File unchanged, no download attempt happy
test_download_unknown_db Unknown db_id ["nonexistent_db_xyz"] Raises FileNotFoundError error
test_download_empty_list Empty db_ids list [] Returns empty dict edge
test_download_returns_correct_paths Paths follow {output_dir}/{db_id}/{db_id}.sqlite ["concert_singer"] Path matches expected pattern happy

Run: pytest tests/test_f004_dataset.py::TestDownloadSpiderDatabases -v


load_spider_questions

Test Description Input Expected Category
test_load_returns_questions Valid db_ids produce question dicts ["concert_singer"] Non-empty list of dicts with query and db_id fields happy
test_load_multiple_dbs Multiple db_ids returns questions from all ["concert_singer", "pets_1"] Questions from both databases present happy
test_load_includes_both_splits Both train and validation splits loaded ["concert_singer"] Questions with both spider splits present happy
test_load_connection_failure Network unavailable (mocked) Any db_ids, no network Raises ConnectionError error
test_load_empty_list Empty db_ids list [] Returns empty list edge

Run: pytest tests/test_f004_dataset.py::TestLoadSpiderQuestions -v


validate_dataset

Test Description Input Expected Category
test_validate_clean_dataset Valid dataset passes all checks Well-formed dataset Empty error list happy
test_validate_missing_field Record missing required field Record without gold_sql Error list includes missing field message error
test_validate_duplicate_ids Two records share same question_id Duplicate "concert_singer_train_001" Error list includes duplicate message error
test_validate_gold_sql_fails gold_sql that does not execute Record with broken SQL Error list includes SQL execution message error
test_validate_gold_answer_mismatch gold_answer does not match re-execution Record with wrong gold_answer Error list includes mismatch message error
test_validate_difficulty_distribution Distribution check ~40/40/20 Dataset with extreme skew (100% easy) Warning or error about distribution edge
test_validate_clean_splits No question appears in both splits Dataset with clean splits No split errors happy
test_validate_cross_split_leak Same question in train and eval Duplicate across splits Error detected error

Run: pytest tests/test_f004_dataset.py::TestValidateDataset -v


2. Integration Tests

Flow: Primary Curation Pipeline

Step Action Expected Verification
1 Read db_list.json for target database IDs Returns list of 10 db_ids len(db_ids) >= 10
2 Download Spider SQLite databases All db_ids have corresponding .sqlite files All paths exist, files > 0 bytes
3 Load raw Spider questions from HuggingFace Questions loaded for all target db_ids len(questions) > 0, each has db_id and query
4 Compute gold_answer for each question gold_answer populated, failed queries skipped No None gold_answers in output; skip count logged
5 Classify answer_type for each question answer_type is one of 5 valid values All values in {"integer","float","string","list","table"}
6 Extract tables_involved from gold_sql Each question has non-empty tables_involved All lists non-empty, all entries are strings
7 Classify difficulty from tables_involved difficulty is one of 3 valid values All values in {"easy","medium","hard"}
8 Assign train/eval split Each question has valid split All values in {"train","eval"}
9 Generate question_id All IDs unique and match format Regex match, len(set(ids)) == len(ids)
10 Validate full dataset validate_dataset returns empty error list len(errors) == 0
11 Write output JSON files questions_train.json and questions_eval.json exist Files parseable as JSON, combined count >= 100

Run: pytest tests/test_f004_integration.py::TestCurationPipeline -v


Flow: Gold SQL Execution Failure (Alternative)

Step Action Expected Verification
1 Provide question with broken gold_sql gold_sql raises sqlite3.Error Exception caught, not propagated
2 Check warning logged Log contains db_id and error details Log output includes warning
3 Question excluded from final dataset Output does not contain the broken question question_id absent from output
4 Remaining questions processed Pipeline continues without interruption Other questions have valid gold_answer

Run: pytest tests/test_f004_integration.py::TestGoldSqlFailure -v


Flow: Validate-Only Mode (Alternative)

Step Action Expected Verification
1 Invoke script with --validate flag Does not download or regenerate data No network calls made
2 Load existing questions_train.json and questions_eval.json Files read successfully No FileNotFoundError
3 Locate SQLite databases in data/databases/ All referenced databases found All db_paths valid
4 Run validate_dataset() Returns list of errors (may be empty) Return type is list[str]
5 Exit code reflects validation result 0 if valid, 1 if invalid Process exit code matches

Run: pytest tests/test_f004_integration.py::TestValidateOnlyMode -v


3. API Tests

No API endpoints defined for F004. This feature is a standalone curation script.


4. E2E Tests

Scenario: Full Dataset Generation

Setup: Clean environment with no existing output files. Network access available. data/questions/db_list.json contains 10 target database IDs.

Actions:

  1. Run the curation script end-to-end (no flags)
  2. Wait for completion

Expected:

  • data/questions/questions_train.json exists and contains valid JSON array
  • data/questions/questions_eval.json exists and contains valid JSON array
  • Combined question count >= 100
  • All questions pass validate_dataset() with zero errors
  • At least 8 distinct database_name values represented
  • Train/eval split approximately 70/30 (+/- 10%)
  • All three difficulty levels present
  • All five answer_type values present (or at least 3)

Run: python scripts/curate_dataset.py && python scripts/curate_dataset.py --validate


Scenario: Validate-Only on Pre-Generated Data

Setup: Output JSON files already exist from a prior run. SQLite databases present in data/databases/.

Actions:

  1. Run the curation script with --validate flag only

Expected:

  • No new files created or modified
  • Validation output printed to stdout
  • Exit code 0 if data is valid

Run: python scripts/curate_dataset.py --validate


Scenario: Idempotent Re-Run

Setup: Output JSON files already exist from a prior run.

Actions:

  1. Run the curation script again (full mode)
  2. Compare output files

Expected:

  • Output files are regenerated
  • Same question count (deterministic for same input)
  • Database files not re-downloaded (skip existing)

Run: python scripts/curate_dataset.py


5. Edge Cases Checklist

  • Null/None gold_answer values handled gracefully
  • Empty string gold_sql skipped or rejected
  • SQL with unicode characters in table/column names
  • Very large query results (1000+ rows) handled by compute_gold_answer
  • Database file that exists but is corrupt (0 bytes or invalid SQLite)
  • db_list.json missing or empty
  • db_list.json with duplicate db_ids
  • Question with gold_sql containing multiple statements (semicolons)
  • Question where gold_sql returns different results on re-execution (non-deterministic)
  • Tables_involved extraction with SQL using CTEs (WITH clause)
  • Tables_involved extraction with SQL using UNION across different tables
  • Extremely long gold_sql (> 1000 chars)
  • Database with no tables (empty schema)
  • Retry behavior on transient network failure during HuggingFace download
  • Retry behavior on transient network failure during database download
  • Concurrent access to same SQLite file (if parallelized)
  • Output JSON file encoding (UTF-8) with special characters in question_text

6. Evidence Requirements

Category Evidence Type Example
Unit tests pytest output X passed, Y skipped
Integration pytest output X passed
E2E Script output + file inspection Generated 105 questions, Validation passed
Output files JSON structure inspection jq length questions_train.json returns count
Skip handling Log output Warning messages for skipped questions
Validation Exit code echo $? returns 0