Verification Specification
Feature: F001 Generated from: specs/F001-VERIFICATION_INPUT.json Generated: 2026-03-24
1. Unit Tests
1.1 SQLAction Type
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_sqlaction_valid_describe | Create action with DESCRIBE type | SQLAction(action_type="DESCRIBE", argument="employees") |
Fields set correctly | happy |
| test_sqlaction_valid_sample | Create action with SAMPLE type | SQLAction(action_type="SAMPLE", argument="orders") |
Fields set correctly | happy |
| test_sqlaction_valid_query | Create action with QUERY type | SQLAction(action_type="QUERY", argument="SELECT 1") |
Fields set correctly | happy |
| test_sqlaction_valid_answer | Create action with ANSWER type | SQLAction(action_type="ANSWER", argument="42") |
Fields set correctly | happy |
| test_sqlaction_empty_argument | Argument is empty string | SQLAction(action_type="QUERY", argument="") |
Accepted at type level (validation at step) | edge |
| test_sqlaction_whitespace_argument | Argument is only whitespace | SQLAction(action_type="QUERY", argument=" ") |
Accepted at type level | edge |
| test_sqlaction_serialization | Round-trip JSON serialization | Create, serialize, deserialize | Identical fields | happy |
1.2 SQLObservation Type
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_obs_all_fields_present | All required fields populated | Full observation construction | All fields accessible with correct types | happy |
| test_obs_done_false_initial | Initial observation has done=False | After reset | done == False |
happy |
| test_obs_reward_none_nonterminal | Non-terminal obs has reward=None | After non-ANSWER step | reward is None |
happy |
| test_obs_reward_set_terminal | Terminal obs has numeric reward | After ANSWER step | reward in {0.0, 1.0} |
happy |
| test_obs_step_count_type | step_count is int | Any observation | isinstance(step_count, int) |
happy |
| test_obs_budget_remaining_type | budget_remaining is int | Any observation | isinstance(budget_remaining, int) |
happy |
| test_obs_action_history_list | action_history is a list of strings | After several steps | isinstance(action_history, list) and all elements are str |
happy |
| test_obs_error_empty_on_success | error is empty string on success | After successful action | error == "" |
happy |
| test_obs_schema_info_nonempty | schema_info is non-empty after reset | After reset | len(schema_info) > 0 |
happy |
1.3 QuestionRecord Type
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_qr_all_fields | All fields populated from valid JSON | Parsed question JSON | All 8 fields present and correct types | happy |
| test_qr_difficulty_values | difficulty in allowed set | Various records | difficulty in {"easy", "medium", "hard"} |
happy |
| test_qr_answer_type_values | answer_type in allowed set | Various records | answer_type in {"integer", "float", "string", "list"} |
happy |
| test_qr_tables_involved_nonempty | tables_involved has at least one entry | Valid question | len(tables_involved) >= 1 |
happy |
1.4 EpisodeContext Type
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_ctx_initial_step_count | step_count starts at 0 | After reset | step_count == 0 |
happy |
| test_ctx_initial_budget | budget starts at configured value (default 15) | After reset | budget == 15 |
happy |
| test_ctx_described_tables_empty | described_tables starts empty | After reset | len(described_tables) == 0 |
happy |
| test_ctx_action_log_empty | action_log starts empty | After reset | len(action_log) == 0 |
happy |
| test_ctx_done_false_initial | done starts False | After reset | done == False |
happy |
| test_ctx_gold_answer_computed | gold_answer is set after reset | After reset | gold_answer is not None |
happy |
1.5 SQLEnvironment.init
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_init_valid | Init with valid paths loads questions | Valid questions_path, db_dir | len(questions) > 0 |
happy |
| test_init_missing_questions_file | questions_path does not exist | "/nonexistent/q.json" |
FileNotFoundError |
error |
| test_init_missing_db_dir | db_dir does not exist | "/nonexistent/dbs/" |
FileNotFoundError or ValueError |
error |
| test_init_invalid_json | questions file is not valid JSON | Path to file with "{bad" |
ValueError |
error |
| test_init_empty_questions | questions file is empty array [] |
Path to file with "[]" |
ValueError (no questions) |
error |
| test_init_custom_budget | Custom step_budget is stored | step_budget=10 |
Environment uses 10-step budget | happy |
| test_init_default_budget | Default step_budget is 15 | Omit step_budget | Budget is 15 | happy |
1.6 SQLEnvironment.reset
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_reset_returns_observation | Returns SQLObservation | reset() |
isinstance(result, SQLObservation) |
happy |
| test_reset_obs_has_question | Observation contains question text | reset() |
len(obs.question) > 0 |
happy |
| test_reset_obs_has_table_names | schema_info contains table names | reset() |
schema_info mentions at least one table | happy |
| test_reset_obs_no_columns | schema_info does NOT reveal columns initially | reset() |
No column names in schema_info | happy |
| test_reset_obs_done_false | done is False | reset() |
obs.done == False |
happy |
| test_reset_obs_reward_none | reward is None | reset() |
obs.reward is None |
happy |
| test_reset_obs_step_count_zero | step_count is 0 | reset() |
obs.step_count == 0 |
happy |
| test_reset_obs_budget_full | budget_remaining equals configured budget | reset() |
obs.budget_remaining == 15 |
happy |
| test_reset_obs_empty_history | action_history is empty | reset() |
obs.action_history == [] |
happy |
| test_reset_obs_empty_error | error is empty string | reset() |
obs.error == "" |
happy |
| test_reset_seed_determinism | Same seed yields same question | reset(seed=42) twice |
Same question both times | happy |
| test_reset_different_seeds | Different seeds can yield different questions | reset(seed=1) vs reset(seed=999) |
Not necessarily same question (probabilistic) | happy |
| test_reset_no_seed_random | Without seed, question is randomly selected | reset() multiple times |
At least one different question (probabilistic) | happy |
| test_reset_episode_id_set | Custom episode_id is reflected | reset(episode_id="ep-123") |
Context has episode_id="ep-123" | happy |
| test_reset_missing_db_file | Question references nonexistent DB | Question with bad db_name | FileNotFoundError |
error |
| test_reset_clears_previous_episode | Calling reset mid-episode starts fresh | reset, step, reset | Second reset has step_count=0, empty history | happy |
1.7 SQLEnvironment.step
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_step_returns_observation | Returns SQLObservation | Any valid action | isinstance(result, SQLObservation) |
happy |
| test_step_never_raises | step never raises exceptions | Various invalid inputs | Returns obs with error field, no exception | happy |
| test_step_increments_step_count | step_count increases by 1 | DESCRIBE action | obs.step_count == 1 |
happy |
| test_step_decrements_budget | budget_remaining decreases by 1 | DESCRIBE/SAMPLE/QUERY | obs.budget_remaining == 14 |
happy |
| test_step_answer_no_budget_decrement | ANSWER does not decrement budget | ANSWER action | budget unchanged from before | happy |
| test_step_appends_action_history | action_history grows each step | Two DESCRIBE actions | len(obs.action_history) == 2 |
happy |
| test_step_invalid_action_type | Unknown action_type returns error | action_type="UNKNOWN" |
obs.error contains "Unknown action type" and lists valid types |
error |
| test_step_empty_argument | Empty argument returns error | argument="" |
obs.error contains "cannot be empty" |
error |
| test_step_whitespace_argument | Whitespace-only argument returns error | argument=" " |
obs.error contains "cannot be empty" |
error |
1.8 SQLEnvironment._execute_sql
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_exec_valid_select | Executes a valid SELECT | "SELECT 1" |
Returns list with results | happy |
| test_exec_non_select_rejected | Non-SELECT is rejected | "DROP TABLE x" |
ValueError raised |
error |
| test_exec_insert_rejected | INSERT is rejected | "INSERT INTO t VALUES(1)" |
ValueError raised |
error |
| test_exec_update_rejected | UPDATE is rejected | "UPDATE t SET x=1" |
ValueError raised |
error |
| test_exec_delete_rejected | DELETE is rejected | "DELETE FROM t" |
ValueError raised |
error |
| test_exec_create_rejected | CREATE is rejected | "CREATE TABLE t(x INT)" |
ValueError raised |
error |
| test_exec_semicolon_multi | Multiple statements rejected | "SELECT 1; DROP TABLE t" |
Error or only first executed safely | error |
| test_exec_syntax_error | Malformed SQL | "SELCET * FORM t" |
sqlite3.OperationalError |
error |
| test_exec_timeout | Query exceeding timeout is interrupted | Long-running query (e.g., cartesian join) | Error raised within ~5 seconds | error |
| test_exec_result_truncation | Results truncated to 20 rows | Query returning 100 rows | len(result) <= 20 |
edge |
| test_exec_empty_result | Query returning no rows | "SELECT * FROM t WHERE 1=0" |
Empty list [] |
edge |
| test_exec_read_only | Connection is truly read-only | Attempt write via raw SQL | Error raised | error |
| test_exec_case_insensitive_select | select (lowercase) is accepted |
"select 1" |
Returns results | edge |
| test_exec_select_with_leading_whitespace | Leading whitespace before SELECT | " SELECT 1" |
Returns results | edge |
| test_exec_select_with_comment | SQL comment before SELECT | "-- comment\nSELECT 1" |
Handled correctly (accepted or rejected consistently) | edge |
1.9 SQLEnvironment._handle_describe
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_describe_valid_table | Known table returns schema info | Existing table name | Result contains column names, types, row count | happy |
| test_describe_unknown_table | Unknown table returns error with list | "nonexistent_table" |
Error containing "not found" and listing available tables | error |
| test_describe_case_sensitivity | Table name matching case behavior | Mixed case table name | Consistent behavior (either matches or provides helpful error) | edge |
| test_describe_updates_described_set | Described table is tracked | DESCRIBE a table | Table appears in described_tables set | happy |
| test_describe_repeated_table | Describing same table twice works | DESCRIBE same table twice | No error, returns same info | edge |
1.10 SQLEnvironment._handle_sample
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_sample_valid_table | Known table returns sample rows | Existing table name | Formatted rows returned | happy |
| test_sample_unknown_table | Unknown table returns error | "nonexistent_table" |
Error with available table list | error |
| test_sample_default_limit | Default limit is 5 rows | Omit limit parameter | At most 5 rows returned | happy |
| test_sample_empty_table | Table with no rows | Empty table | Returns empty or "no rows" result | edge |
1.11 SQLEnvironment._handle_query
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_query_valid_select | Valid SELECT returns formatted results | "SELECT * FROM t LIMIT 3" |
Formatted result string | happy |
| test_query_non_select | Non-SELECT returns error | "DROP TABLE t" |
Error about SELECT-only | error |
| test_query_syntax_error | Bad SQL returns error | "SELCET *" |
Error with sqlite3 message | error |
| test_query_timeout | Slow query returns timeout error | Expensive query | Error mentioning "timed out" and "5.0 seconds" | error |
| test_query_truncation_indicator | >20 rows shows truncation notice | Query returning many rows | Result indicates truncation occurred | edge |
| test_query_exactly_20_rows | Exactly 20 rows shows no truncation | Query returning 20 rows | No truncation indicator | edge |
1.12 SQLEnvironment._handle_answer
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_answer_correct | Correct answer yields reward 1.0 | Gold answer value | (True, 1.0) |
happy |
| test_answer_incorrect | Incorrect answer yields reward 0.0 | Wrong value | (False, 0.0) |
happy |
| test_answer_case_insensitive | Case-insensitive comparison | "PARIS" vs gold "paris" |
(True, 1.0) |
edge |
| test_answer_whitespace_stripped | Leading/trailing whitespace stripped | " 42 " vs gold "42" |
(True, 1.0) |
edge |
| test_answer_sets_done | Episode is marked done after answer | Any answer | done == True |
happy |
| test_answer_empty_string | Empty answer handled | "" |
(False, 0.0) |
edge |
1.13 SQLEnvironment._load_questions
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_load_valid_json | Valid questions JSON parsed | Path to valid file | list[QuestionRecord] with correct count |
happy |
| test_load_file_not_found | Missing file raises error | "/nonexistent.json" |
FileNotFoundError |
error |
| test_load_invalid_json | Malformed JSON raises error | Path to "{bad" |
ValueError |
error |
| test_load_missing_fields | Record missing required field | JSON with missing gold_sql |
ValueError or KeyError |
error |
1.14 SQLEnvironment._open_db
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_open_valid_db | Opens existing database | Valid db_name | sqlite3.Connection returned |
happy |
| test_open_missing_db | Missing database raises error | "nonexistent_db" |
FileNotFoundError |
error |
| test_open_read_only | Connection is read-only | Valid db_name | Write operations fail | happy |
1.15 SQLEnvironment._build_observation
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_build_obs_reflects_context | Observation matches episode context | Various context states | Fields match context values | happy |
| test_build_obs_schema_progressive | schema_info includes described columns | After DESCRIBE action | schema_info contains column details for described tables | happy |
| test_build_obs_schema_initial | Initial schema_info has only table names | Before any DESCRIBE | Only table names, no columns | happy |
Run: pytest tests/unit/test_sql_environment.py -v
2. Integration Tests
Flow: Full Episode Lifecycle (Reset-Explore-Answer)
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | reset(seed=42) |
Observation with question, table names, step=0, budget=15 | Assert all initial observation fields |
| 2 | step(DESCRIBE, "employees") |
Column info in result, step=1, budget=14 | Assert result contains column names, step incremented |
| 3 | step(SAMPLE, "employees") |
Sample rows in result, step=2, budget=13 | Assert formatted rows in result |
| 4 | step(QUERY, "SELECT COUNT(*) FROM employees") |
Count result, step=3, budget=12 | Assert numeric result in output |
| 5 | step(ANSWER, "<correct_value>") |
done=True, reward=1.0 | Assert terminal state with positive reward |
| 6 | Verify no more steps accepted | Post-episode step attempt handled gracefully | Assert error or episode-over signal |
Flow: Budget Exhaustion
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | reset() with budget=3 |
Observation with budget=3 | Assert budget_remaining==3 |
| 2 | step(DESCRIBE, "t1") |
budget=2 | Assert budget_remaining==2 |
| 3 | step(DESCRIBE, "t2") |
budget=1 | Assert budget_remaining==1 |
| 4 | step(DESCRIBE, "t3") |
budget=0, done=True, reward=0.0 | Assert terminal state, zero reward |
Flow: Error Recovery Within Episode
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | reset() |
Normal initial observation | Assert done==False |
| 2 | step(QUERY, "DROP TABLE x") |
Error in obs.error, done=False | Assert error message, episode continues |
| 3 | step(QUERY, "SELCET * FORM t") |
Error in obs.error, done=False | Assert SQL error message, episode continues |
| 4 | step(DESCRIBE, "nonexistent") |
Error in obs.error, done=False | Assert "not found" error, episode continues |
| 5 | step(QUERY, "SELECT 1") |
Success result, no error | Assert error=="" and result has data |
Flow: Progressive Schema Discovery
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | reset() |
schema_info has table names only | Assert table names present, no column details |
| 2 | step(DESCRIBE, "table_a") |
schema_info now includes table_a columns | Assert column names for table_a appear |
| 3 | step(DESCRIBE, "table_b") |
schema_info includes both table_a and table_b columns | Assert both tables' columns present |
Flow: Seed Determinism
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | reset(seed=42) |
Observation A | Record question and db |
| 2 | reset(seed=42) |
Observation B identical to A | Assert same question_text and schema_info |
Run: pytest tests/integration/test_episode_lifecycle.py -v
3. API Tests
POST /reset
| Test | Request | Status | Response | Category |
|---|---|---|---|---|
| test_reset_no_params | {} |
200 | SQLObservation with done=False, budget=15 | happy |
| test_reset_with_seed | {"seed": 42} |
200 | Deterministic SQLObservation | happy |
| test_reset_with_episode_id | {"episode_id": "ep-1"} |
200 | SQLObservation (episode_id reflected in context) | happy |
| test_reset_null_seed | {"seed": null} |
200 | Random question selected | happy |
| test_reset_server_error_missing_db | (DB file removed) | 500 | Error response indicating database not found | error |
| test_reset_response_schema | {} |
200 | Response contains all SQLObservation fields | happy |
POST /step
| Test | Request | Status | Response | Category |
|---|---|---|---|---|
| test_step_describe | {"action_type": "DESCRIBE", "argument": "employees"} |
200 | SQLObservation with schema result | happy |
| test_step_sample | {"action_type": "SAMPLE", "argument": "employees"} |
200 | SQLObservation with sample rows | happy |
| test_step_query | {"action_type": "QUERY", "argument": "SELECT 1"} |
200 | SQLObservation with query result | happy |
| test_step_answer | {"action_type": "ANSWER", "argument": "42"} |
200 | SQLObservation with done=True | happy |
| test_step_missing_action_type | {"argument": "x"} |
422 | Validation error | error |
| test_step_missing_argument | {"action_type": "QUERY"} |
422 | Validation error | error |
| test_step_empty_body | {} |
422 | Validation error | error |
| test_step_invalid_action_type | {"action_type": "HACK", "argument": "x"} |
200 | SQLObservation with error field set | error |
| test_step_without_reset | Call step before reset | 200 or 500 | Graceful error (not a crash) | error |
Run: pytest tests/api/test_endpoints.py -v
4. E2E Tests
Scenario: Agent Solves Easy Question
Setup: Environment initialized with Spider questions and databases. Select an easy question with a known answer. Actions:
- POST /reset with seed that selects an easy question
- POST /step with DESCRIBE on relevant table
- POST /step with SAMPLE on relevant table
- POST /step with QUERY using correct SQL
- POST /step with ANSWER providing correct value Expected: reward=1.0, done=True, total steps <= 15
Scenario: Agent Exhausts Budget
Setup: Environment initialized with budget=3. Actions:
- POST /reset
- POST /step with DESCRIBE x3 Expected: After 3rd step: done=True, reward=0.0, budget_remaining=0
Scenario: Agent Submits Wrong Answer
Setup: Environment initialized normally. Actions:
- POST /reset
- POST /step with ANSWER providing deliberately wrong value Expected: done=True, reward=0.0
Scenario: Agent Recovers From Errors
Setup: Environment initialized normally. Actions:
- POST /reset
- POST /step with QUERY "DROP TABLE x" (rejected)
- POST /step with DESCRIBE "nonexistent_table" (error)
- POST /step with valid QUERY
- POST /step with correct ANSWER Expected: Error steps counted against budget, valid steps succeed, episode completes with reward=1.0
Run: pytest tests/e2e/test_full_episodes.py -v
5. Edge Cases Checklist
- Null/None seed and episode_id in reset
- Empty string argument for all action types
- Whitespace-only argument for all action types
- Very long SQL query (10KB+)
- SQL with unicode characters in string literals
- SQL with special characters (backticks, double quotes, brackets)
- Table name with special characters or spaces
- ANSWER with numeric value as string vs integer
- ANSWER with leading zeros ("042" vs "42")
- ANSWER with trailing decimal ("42.0" vs "42")
- Budget of 1 (single step before exhaustion)
- Budget of 0 (should episode immediately end?)
- Calling step after episode is done
- Calling reset multiple times without stepping
- Concurrent episodes (if supported)
- Database with zero tables
- Database with very large tables (performance)
- Question whose gold_sql returns empty result
- SELECT with subqueries, CTEs, UNION
- SQL injection attempts via argument field (
"; DROP TABLE--) - Very long table name in DESCRIBE/SAMPLE
6. Evidence Requirements
| Category | Evidence Type | Example |
|---|---|---|
| Unit tests | pytest output | 45 passed |
| Integration | pytest output | 5 passed |
| API tests | pytest output or httpx test client | 9 passed |
| E2E | pytest output with real Spider DB | 4 passed |
| SQL sandbox | Demonstrate write rejection | ValueError: Only SELECT queries are allowed |
| Timeout | Demonstrate query interruption | Query timed out after 5.0 seconds |
| Budget | Demonstrate forced termination | done=True, reward=0.0 at budget=0 |
| Answer comparison | Demonstrate case-insensitive match | "PARIS" == "paris" -> reward=1.0 |