Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F004-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

17.2 kB

Verification Specification

Feature: F004 Generated from: specs/F004-VERIFICATION_INPUT.json Generated: 2026-03-24

1. Unit Tests

EnrichedQuestionRecord (Type)

Test	Description	Input	Expected	Category
test_record_all_fields_present	All 9 required fields populated	Full valid record dict	All fields accessible, no missing keys	happy
test_record_question_id_format	question_id matches `{db_id}_{split}_{index:03d}`	`"concert_singer_train_007"`	Passes regex `^[a-z_]+_(train	eval)_\d{3}$`
test_record_question_id_invalid	Rejects malformed question_id	`"bad-id"`	Validation error or detectable as invalid	error
test_record_answer_type_enum	answer_type is one of allowed values	`"integer"`, `"float"`, `"string"`, `"list"`, `"table"`	Each accepted	happy
test_record_answer_type_invalid	Rejects unknown answer_type	`"boolean"`	Rejected or flagged	error
test_record_difficulty_enum	difficulty is one of allowed values	`"easy"`, `"medium"`, `"hard"`	Each accepted	happy
test_record_difficulty_invalid	Rejects unknown difficulty	`"extreme"`	Rejected or flagged	error
test_record_split_enum	split is one of allowed values	`"train"`, `"eval"`	Each accepted	happy
test_record_split_invalid	Rejects unknown split	`"test"`	Rejected or flagged	error
test_record_tables_involved_nonempty	tables_involved has at least one entry	`["students"]`	Accepted	happy
test_record_tables_involved_empty	Empty tables_involved is rejected	`[]`	Rejected or flagged by validation	edge
test_record_gold_sql_nonempty	gold_sql is a non-empty string	`"SELECT COUNT(*) FROM students"`	Accepted	happy
test_record_gold_sql_empty	Empty gold_sql is rejected	`""`	Rejected or flagged	edge
test_record_gold_answer_types	gold_answer can hold int, float, str, list, list-of-lists	`42`, `3.14`, `"Alice"`, `[1,2]`, `[[1,"a"]]`	Each stored and retrievable	happy

Run: pytest tests/test_f004_dataset.py::TestEnrichedQuestionRecord -v

classify_answer_type

Test	Description	Input	Expected	Category
test_classify_integer	Single integer answer	`42`	`"integer"`	happy
test_classify_float	Single float answer	`3.14`	`"float"`	happy
test_classify_string	Single string answer	`"Alice"`	`"string"`	happy
test_classify_list	Flat list (single column, multiple rows)	`[1, 2, 3]`	`"list"`	happy
test_classify_table	List of tuples/lists (multi-column result)	`[(1, "a"), (2, "b")]`	`"table"`	happy
test_classify_none	None/null answer	`None`	Defined behavior (error or specific type)	edge
test_classify_empty_list	Empty list	`[]`	`"list"` or defined behavior	edge
test_classify_single_row_tuple	Single-element tuple	`(42,)`	`"integer"` (unwrapped) or `"list"`	edge
test_classify_nested_single	Single-row multi-column	`[(1, "a")]`	`"table"`	edge
test_classify_boolean	Boolean answer	`True`	Defined fallback behavior	edge

Run: pytest tests/test_f004_dataset.py::TestClassifyAnswerType -v

extract_tables_involved

Test	Description	Input	Expected	Category
test_extract_single_table	Simple SELECT from one table	`"SELECT * FROM students"`	`["students"]`	happy
test_extract_join	JOIN with two tables	`"SELECT s.name FROM students s JOIN courses c ON s.id = c.student_id"`	`["courses", "students"]` (sorted)	happy
test_extract_subquery	Subquery referencing different table	`"SELECT * FROM students WHERE id IN (SELECT student_id FROM enrollments)"`	`["enrollments", "students"]` (sorted)	happy
test_extract_deduplication	Same table referenced multiple times	`"SELECT a.x, b.y FROM t1 a JOIN t1 b ON a.id = b.id"`	`["t1"]` (deduplicated)	happy
test_extract_case_insensitive	Mixed case SQL keywords	`"select * FROM Students"`	`["Students"]` or `["students"]` (consistent)	edge
test_extract_with_alias	Table alias should resolve to table name	`"SELECT s.name FROM students AS s"`	`["students"]`	edge
test_extract_multiple_joins	Three or more tables joined	`"SELECT * FROM a JOIN b ON a.id=b.id JOIN c ON b.id=c.id"`	`["a", "b", "c"]` (sorted)	happy
test_extract_empty_sql	Empty SQL string	`""`	`[]` or error	edge
test_extract_no_from	SQL without FROM clause	`"SELECT 1+1"`	`[]`	edge

Run: pytest tests/test_f004_dataset.py::TestExtractTablesInvolved -v

classify_difficulty

Test	Description	Input	Expected	Category
test_difficulty_easy_1_table	1 table involved	`["students"]`	`"easy"`	happy
test_difficulty_easy_2_tables	2 tables involved	`["students", "courses"]`	`"easy"`	happy
test_difficulty_medium_3_tables	3 tables involved	`["a", "b", "c"]`	`"medium"`	happy
test_difficulty_hard_4_tables	4 tables involved	`["a", "b", "c", "d"]`	`"hard"`	happy
test_difficulty_hard_many_tables	6+ tables involved	`["a", "b", "c", "d", "e", "f"]`	`"hard"`	happy
test_difficulty_empty_tables	0 tables (edge case)	`[]`	Defined behavior (error or `"easy"`)	edge

Run: pytest tests/test_f004_dataset.py::TestClassifyDifficulty -v

compute_gold_answer

Test	Description	Input	Expected	Category
test_compute_valid_select	Valid SELECT on real SQLite DB	`"SELECT COUNT(*) FROM singer"`, valid db_path	Integer result	happy
test_compute_multirow	Query returning multiple rows	`"SELECT * FROM singer LIMIT 3"`, valid db_path	List/table result	happy
test_compute_invalid_sql	Syntactically invalid SQL	`"SELCT * FORM x"`, valid db_path	Raises `sqlite3.Error`	error
test_compute_missing_table	SQL references non-existent table	`"SELECT * FROM nonexistent"`, valid db_path	Raises `sqlite3.Error`	error
test_compute_missing_db	Database file does not exist	`"SELECT 1"`, `/tmp/nonexistent.sqlite`	Raises `sqlite3.Error` or `FileNotFoundError`	error
test_compute_empty_result	Query returns no rows	`"SELECT * FROM singer WHERE 1=0"`, valid db_path	Empty result (e.g., `[]`)	edge
test_compute_null_result	Query returning NULL	`"SELECT NULL"`, valid db_path	`None` or defined null handling	edge

Run: pytest tests/test_f004_dataset.py::TestComputeGoldAnswer -v

assign_splits

Test	Description	Input	Expected	Category
test_assign_train_from_spider_train	Spider train split maps to train	Questions with `spider_split="train"`	`split="train"`	happy
test_assign_eval_from_spider_validation	Spider validation split maps to eval	Questions with `spider_split="validation"`	`split="eval"`	happy
test_assign_preserves_all_questions	No questions are dropped	10 input questions	10 output questions	happy
test_assign_mixed_splits	Mix of train and validation	7 train + 3 validation	7 train + 3 eval	happy
test_assign_all_train	All questions from train split	5 train questions	All `split="train"`	edge
test_assign_all_eval	All questions from validation split	5 validation questions	All `split="eval"`	edge

Run: pytest tests/test_f004_dataset.py::TestAssignSplits -v

download_spider_databases

Test	Description	Input	Expected	Category
test_download_creates_files	Download for known db_ids produces SQLite files	`["concert_singer"]`, temp dir	Dict mapping db_id to valid Path, file exists	happy
test_download_skips_existing	Existing database is not re-downloaded	Pre-existing file, same db_id	File unchanged, no download attempt	happy
test_download_unknown_db	Unknown db_id	`["nonexistent_db_xyz"]`	Raises `FileNotFoundError`	error
test_download_empty_list	Empty db_ids list	`[]`	Returns empty dict	edge
test_download_returns_correct_paths	Paths follow `{output_dir}/{db_id}/{db_id}.sqlite`	`["concert_singer"]`	Path matches expected pattern	happy

Run: pytest tests/test_f004_dataset.py::TestDownloadSpiderDatabases -v

load_spider_questions

Test	Description	Input	Expected	Category
test_load_returns_questions	Valid db_ids produce question dicts	`["concert_singer"]`	Non-empty list of dicts with `query` and `db_id` fields	happy
test_load_multiple_dbs	Multiple db_ids returns questions from all	`["concert_singer", "pets_1"]`	Questions from both databases present	happy
test_load_includes_both_splits	Both train and validation splits loaded	`["concert_singer"]`	Questions with both spider splits present	happy
test_load_connection_failure	Network unavailable (mocked)	Any db_ids, no network	Raises `ConnectionError`	error
test_load_empty_list	Empty db_ids list	`[]`	Returns empty list	edge

Run: pytest tests/test_f004_dataset.py::TestLoadSpiderQuestions -v

validate_dataset

Test	Description	Input	Expected	Category
test_validate_clean_dataset	Valid dataset passes all checks	Well-formed dataset	Empty error list	happy
test_validate_missing_field	Record missing required field	Record without `gold_sql`	Error list includes missing field message	error
test_validate_duplicate_ids	Two records share same question_id	Duplicate `"concert_singer_train_001"`	Error list includes duplicate message	error
test_validate_gold_sql_fails	gold_sql that does not execute	Record with broken SQL	Error list includes SQL execution message	error
test_validate_gold_answer_mismatch	gold_answer does not match re-execution	Record with wrong gold_answer	Error list includes mismatch message	error
test_validate_difficulty_distribution	Distribution check ~40/40/20	Dataset with extreme skew (100% easy)	Warning or error about distribution	edge
test_validate_clean_splits	No question appears in both splits	Dataset with clean splits	No split errors	happy
test_validate_cross_split_leak	Same question in train and eval	Duplicate across splits	Error detected	error

Run: pytest tests/test_f004_dataset.py::TestValidateDataset -v

2. Integration Tests

Flow: Primary Curation Pipeline

Step	Action	Expected	Verification
1	Read db_list.json for target database IDs	Returns list of 10 db_ids	`len(db_ids) >= 10`
2	Download Spider SQLite databases	All db_ids have corresponding .sqlite files	All paths exist, files > 0 bytes
3	Load raw Spider questions from HuggingFace	Questions loaded for all target db_ids	`len(questions) > 0`, each has `db_id` and `query`
4	Compute gold_answer for each question	gold_answer populated, failed queries skipped	No None gold_answers in output; skip count logged
5	Classify answer_type for each question	answer_type is one of 5 valid values	All values in `{"integer","float","string","list","table"}`
6	Extract tables_involved from gold_sql	Each question has non-empty tables_involved	All lists non-empty, all entries are strings
7	Classify difficulty from tables_involved	difficulty is one of 3 valid values	All values in `{"easy","medium","hard"}`
8	Assign train/eval split	Each question has valid split	All values in `{"train","eval"}`
9	Generate question_id	All IDs unique and match format	Regex match, `len(set(ids)) == len(ids)`
10	Validate full dataset	validate_dataset returns empty error list	`len(errors) == 0`
11	Write output JSON files	questions_train.json and questions_eval.json exist	Files parseable as JSON, combined count >= 100

Run: pytest tests/test_f004_integration.py::TestCurationPipeline -v

Flow: Gold SQL Execution Failure (Alternative)

Step	Action	Expected	Verification
1	Provide question with broken gold_sql	gold_sql raises sqlite3.Error	Exception caught, not propagated
2	Check warning logged	Log contains db_id and error details	Log output includes warning
3	Question excluded from final dataset	Output does not contain the broken question	question_id absent from output
4	Remaining questions processed	Pipeline continues without interruption	Other questions have valid gold_answer

Run: pytest tests/test_f004_integration.py::TestGoldSqlFailure -v

Flow: Validate-Only Mode (Alternative)

Step	Action	Expected	Verification
1	Invoke script with `--validate` flag	Does not download or regenerate data	No network calls made
2	Load existing questions_train.json and questions_eval.json	Files read successfully	No FileNotFoundError
3	Locate SQLite databases in data/databases/	All referenced databases found	All db_paths valid
4	Run validate_dataset()	Returns list of errors (may be empty)	Return type is list[str]
5	Exit code reflects validation result	0 if valid, 1 if invalid	Process exit code matches

Run: pytest tests/test_f004_integration.py::TestValidateOnlyMode -v

3. API Tests

No API endpoints defined for F004. This feature is a standalone curation script.

4. E2E Tests

Scenario: Full Dataset Generation

Setup: Clean environment with no existing output files. Network access available. data/questions/db_list.json contains 10 target database IDs.

Actions:

Run the curation script end-to-end (no flags)
Wait for completion

Expected:

data/questions/questions_train.json exists and contains valid JSON array
data/questions/questions_eval.json exists and contains valid JSON array
Combined question count >= 100
All questions pass validate_dataset() with zero errors
At least 8 distinct database_name values represented
Train/eval split approximately 70/30 (+/- 10%)
All three difficulty levels present
All five answer_type values present (or at least 3)

Run: python scripts/curate_dataset.py && python scripts/curate_dataset.py --validate

Scenario: Validate-Only on Pre-Generated Data

Setup: Output JSON files already exist from a prior run. SQLite databases present in data/databases/.

Actions:

Run the curation script with --validate flag only

Expected:

No new files created or modified
Validation output printed to stdout
Exit code 0 if data is valid

Run: python scripts/curate_dataset.py --validate

Scenario: Idempotent Re-Run

Setup: Output JSON files already exist from a prior run.

Actions:

Run the curation script again (full mode)
Compare output files

Expected:

Output files are regenerated
Same question count (deterministic for same input)
Database files not re-downloaded (skip existing)

Run: python scripts/curate_dataset.py

5. Edge Cases Checklist

Null/None gold_answer values handled gracefully
Empty string gold_sql skipped or rejected
SQL with unicode characters in table/column names
Very large query results (1000+ rows) handled by compute_gold_answer
Database file that exists but is corrupt (0 bytes or invalid SQLite)
db_list.json missing or empty
db_list.json with duplicate db_ids
Question with gold_sql containing multiple statements (semicolons)
Question where gold_sql returns different results on re-execution (non-deterministic)
Tables_involved extraction with SQL using CTEs (WITH clause)
Tables_involved extraction with SQL using UNION across different tables
Extremely long gold_sql (> 1000 chars)
Database with no tables (empty schema)
Retry behavior on transient network failure during HuggingFace download
Retry behavior on transient network failure during database download
Concurrent access to same SQLite file (if parallelized)
Output JSON file encoding (UTF-8) with special characters in question_text

6. Evidence Requirements

Category	Evidence Type	Example
Unit tests	pytest output	`X passed, Y skipped`
Integration	pytest output	`X passed`
E2E	Script output + file inspection	`Generated 105 questions`, `Validation passed`
Output files	JSON structure inspection	`jq length questions_train.json` returns count
Skip handling	Log output	Warning messages for skipped questions
Validation	Exit code	`echo $?` returns `0`