# Verification Specification **Feature:** F001 **Generated from:** specs/F001-VERIFICATION_INPUT.json **Generated:** 2026-03-24 --- ## 1. Unit Tests ### 1.1 SQLAction Type | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_sqlaction_valid_describe | Create action with DESCRIBE type | `SQLAction(action_type="DESCRIBE", argument="employees")` | Fields set correctly | happy | | test_sqlaction_valid_sample | Create action with SAMPLE type | `SQLAction(action_type="SAMPLE", argument="orders")` | Fields set correctly | happy | | test_sqlaction_valid_query | Create action with QUERY type | `SQLAction(action_type="QUERY", argument="SELECT 1")` | Fields set correctly | happy | | test_sqlaction_valid_answer | Create action with ANSWER type | `SQLAction(action_type="ANSWER", argument="42")` | Fields set correctly | happy | | test_sqlaction_empty_argument | Argument is empty string | `SQLAction(action_type="QUERY", argument="")` | Accepted at type level (validation at step) | edge | | test_sqlaction_whitespace_argument | Argument is only whitespace | `SQLAction(action_type="QUERY", argument=" ")` | Accepted at type level | edge | | test_sqlaction_serialization | Round-trip JSON serialization | Create, serialize, deserialize | Identical fields | happy | ### 1.2 SQLObservation Type | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_obs_all_fields_present | All required fields populated | Full observation construction | All fields accessible with correct types | happy | | test_obs_done_false_initial | Initial observation has done=False | After reset | `done == False` | happy | | test_obs_reward_none_nonterminal | Non-terminal obs has reward=None | After non-ANSWER step | `reward is None` | happy | | test_obs_reward_set_terminal | Terminal obs has numeric reward | After ANSWER step | `reward in {0.0, 1.0}` | happy | | test_obs_step_count_type | step_count is int | Any observation | `isinstance(step_count, int)` | happy | | test_obs_budget_remaining_type | budget_remaining is int | Any observation | `isinstance(budget_remaining, int)` | happy | | test_obs_action_history_list | action_history is a list of strings | After several steps | `isinstance(action_history, list)` and all elements are `str` | happy | | test_obs_error_empty_on_success | error is empty string on success | After successful action | `error == ""` | happy | | test_obs_schema_info_nonempty | schema_info is non-empty after reset | After reset | `len(schema_info) > 0` | happy | ### 1.3 QuestionRecord Type | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_qr_all_fields | All fields populated from valid JSON | Parsed question JSON | All 8 fields present and correct types | happy | | test_qr_difficulty_values | difficulty in allowed set | Various records | `difficulty in {"easy", "medium", "hard"}` | happy | | test_qr_answer_type_values | answer_type in allowed set | Various records | `answer_type in {"integer", "float", "string", "list"}` | happy | | test_qr_tables_involved_nonempty | tables_involved has at least one entry | Valid question | `len(tables_involved) >= 1` | happy | ### 1.4 EpisodeContext Type | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_ctx_initial_step_count | step_count starts at 0 | After reset | `step_count == 0` | happy | | test_ctx_initial_budget | budget starts at configured value (default 15) | After reset | `budget == 15` | happy | | test_ctx_described_tables_empty | described_tables starts empty | After reset | `len(described_tables) == 0` | happy | | test_ctx_action_log_empty | action_log starts empty | After reset | `len(action_log) == 0` | happy | | test_ctx_done_false_initial | done starts False | After reset | `done == False` | happy | | test_ctx_gold_answer_computed | gold_answer is set after reset | After reset | `gold_answer is not None` | happy | ### 1.5 SQLEnvironment.__init__ | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_init_valid | Init with valid paths loads questions | Valid questions_path, db_dir | `len(questions) > 0` | happy | | test_init_missing_questions_file | questions_path does not exist | `"/nonexistent/q.json"` | `FileNotFoundError` | error | | test_init_missing_db_dir | db_dir does not exist | `"/nonexistent/dbs/"` | `FileNotFoundError` or `ValueError` | error | | test_init_invalid_json | questions file is not valid JSON | Path to file with `"{bad"` | `ValueError` | error | | test_init_empty_questions | questions file is empty array `[]` | Path to file with `"[]"` | `ValueError` (no questions) | error | | test_init_custom_budget | Custom step_budget is stored | `step_budget=10` | Environment uses 10-step budget | happy | | test_init_default_budget | Default step_budget is 15 | Omit step_budget | Budget is 15 | happy | ### 1.6 SQLEnvironment.reset | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_reset_returns_observation | Returns SQLObservation | `reset()` | `isinstance(result, SQLObservation)` | happy | | test_reset_obs_has_question | Observation contains question text | `reset()` | `len(obs.question) > 0` | happy | | test_reset_obs_has_table_names | schema_info contains table names | `reset()` | schema_info mentions at least one table | happy | | test_reset_obs_no_columns | schema_info does NOT reveal columns initially | `reset()` | No column names in schema_info | happy | | test_reset_obs_done_false | done is False | `reset()` | `obs.done == False` | happy | | test_reset_obs_reward_none | reward is None | `reset()` | `obs.reward is None` | happy | | test_reset_obs_step_count_zero | step_count is 0 | `reset()` | `obs.step_count == 0` | happy | | test_reset_obs_budget_full | budget_remaining equals configured budget | `reset()` | `obs.budget_remaining == 15` | happy | | test_reset_obs_empty_history | action_history is empty | `reset()` | `obs.action_history == []` | happy | | test_reset_obs_empty_error | error is empty string | `reset()` | `obs.error == ""` | happy | | test_reset_seed_determinism | Same seed yields same question | `reset(seed=42)` twice | Same question both times | happy | | test_reset_different_seeds | Different seeds can yield different questions | `reset(seed=1)` vs `reset(seed=999)` | Not necessarily same question (probabilistic) | happy | | test_reset_no_seed_random | Without seed, question is randomly selected | `reset()` multiple times | At least one different question (probabilistic) | happy | | test_reset_episode_id_set | Custom episode_id is reflected | `reset(episode_id="ep-123")` | Context has episode_id="ep-123" | happy | | test_reset_missing_db_file | Question references nonexistent DB | Question with bad db_name | `FileNotFoundError` | error | | test_reset_clears_previous_episode | Calling reset mid-episode starts fresh | reset, step, reset | Second reset has step_count=0, empty history | happy | ### 1.7 SQLEnvironment.step | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_step_returns_observation | Returns SQLObservation | Any valid action | `isinstance(result, SQLObservation)` | happy | | test_step_never_raises | step never raises exceptions | Various invalid inputs | Returns obs with error field, no exception | happy | | test_step_increments_step_count | step_count increases by 1 | DESCRIBE action | `obs.step_count == 1` | happy | | test_step_decrements_budget | budget_remaining decreases by 1 | DESCRIBE/SAMPLE/QUERY | `obs.budget_remaining == 14` | happy | | test_step_answer_no_budget_decrement | ANSWER does not decrement budget | ANSWER action | budget unchanged from before | happy | | test_step_appends_action_history | action_history grows each step | Two DESCRIBE actions | `len(obs.action_history) == 2` | happy | | test_step_invalid_action_type | Unknown action_type returns error | `action_type="UNKNOWN"` | `obs.error` contains "Unknown action type" and lists valid types | error | | test_step_empty_argument | Empty argument returns error | `argument=""` | `obs.error` contains "cannot be empty" | error | | test_step_whitespace_argument | Whitespace-only argument returns error | `argument=" "` | `obs.error` contains "cannot be empty" | error | ### 1.8 SQLEnvironment._execute_sql | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_exec_valid_select | Executes a valid SELECT | `"SELECT 1"` | Returns list with results | happy | | test_exec_non_select_rejected | Non-SELECT is rejected | `"DROP TABLE x"` | `ValueError` raised | error | | test_exec_insert_rejected | INSERT is rejected | `"INSERT INTO t VALUES(1)"` | `ValueError` raised | error | | test_exec_update_rejected | UPDATE is rejected | `"UPDATE t SET x=1"` | `ValueError` raised | error | | test_exec_delete_rejected | DELETE is rejected | `"DELETE FROM t"` | `ValueError` raised | error | | test_exec_create_rejected | CREATE is rejected | `"CREATE TABLE t(x INT)"` | `ValueError` raised | error | | test_exec_semicolon_multi | Multiple statements rejected | `"SELECT 1; DROP TABLE t"` | Error or only first executed safely | error | | test_exec_syntax_error | Malformed SQL | `"SELCET * FORM t"` | `sqlite3.OperationalError` | error | | test_exec_timeout | Query exceeding timeout is interrupted | Long-running query (e.g., cartesian join) | Error raised within ~5 seconds | error | | test_exec_result_truncation | Results truncated to 20 rows | Query returning 100 rows | `len(result) <= 20` | edge | | test_exec_empty_result | Query returning no rows | `"SELECT * FROM t WHERE 1=0"` | Empty list `[]` | edge | | test_exec_read_only | Connection is truly read-only | Attempt write via raw SQL | Error raised | error | | test_exec_case_insensitive_select | `select` (lowercase) is accepted | `"select 1"` | Returns results | edge | | test_exec_select_with_leading_whitespace | Leading whitespace before SELECT | `" SELECT 1"` | Returns results | edge | | test_exec_select_with_comment | SQL comment before SELECT | `"-- comment\nSELECT 1"` | Handled correctly (accepted or rejected consistently) | edge | ### 1.9 SQLEnvironment._handle_describe | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_describe_valid_table | Known table returns schema info | Existing table name | Result contains column names, types, row count | happy | | test_describe_unknown_table | Unknown table returns error with list | `"nonexistent_table"` | Error containing "not found" and listing available tables | error | | test_describe_case_sensitivity | Table name matching case behavior | Mixed case table name | Consistent behavior (either matches or provides helpful error) | edge | | test_describe_updates_described_set | Described table is tracked | DESCRIBE a table | Table appears in described_tables set | happy | | test_describe_repeated_table | Describing same table twice works | DESCRIBE same table twice | No error, returns same info | edge | ### 1.10 SQLEnvironment._handle_sample | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_sample_valid_table | Known table returns sample rows | Existing table name | Formatted rows returned | happy | | test_sample_unknown_table | Unknown table returns error | `"nonexistent_table"` | Error with available table list | error | | test_sample_default_limit | Default limit is 5 rows | Omit limit parameter | At most 5 rows returned | happy | | test_sample_empty_table | Table with no rows | Empty table | Returns empty or "no rows" result | edge | ### 1.11 SQLEnvironment._handle_query | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_query_valid_select | Valid SELECT returns formatted results | `"SELECT * FROM t LIMIT 3"` | Formatted result string | happy | | test_query_non_select | Non-SELECT returns error | `"DROP TABLE t"` | Error about SELECT-only | error | | test_query_syntax_error | Bad SQL returns error | `"SELCET *"` | Error with sqlite3 message | error | | test_query_timeout | Slow query returns timeout error | Expensive query | Error mentioning "timed out" and "5.0 seconds" | error | | test_query_truncation_indicator | >20 rows shows truncation notice | Query returning many rows | Result indicates truncation occurred | edge | | test_query_exactly_20_rows | Exactly 20 rows shows no truncation | Query returning 20 rows | No truncation indicator | edge | ### 1.12 SQLEnvironment._handle_answer | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_answer_correct | Correct answer yields reward 1.0 | Gold answer value | `(True, 1.0)` | happy | | test_answer_incorrect | Incorrect answer yields reward 0.0 | Wrong value | `(False, 0.0)` | happy | | test_answer_case_insensitive | Case-insensitive comparison | `"PARIS"` vs gold `"paris"` | `(True, 1.0)` | edge | | test_answer_whitespace_stripped | Leading/trailing whitespace stripped | `" 42 "` vs gold `"42"` | `(True, 1.0)` | edge | | test_answer_sets_done | Episode is marked done after answer | Any answer | `done == True` | happy | | test_answer_empty_string | Empty answer handled | `""` | `(False, 0.0)` | edge | ### 1.13 SQLEnvironment._load_questions | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_load_valid_json | Valid questions JSON parsed | Path to valid file | `list[QuestionRecord]` with correct count | happy | | test_load_file_not_found | Missing file raises error | `"/nonexistent.json"` | `FileNotFoundError` | error | | test_load_invalid_json | Malformed JSON raises error | Path to `"{bad"` | `ValueError` | error | | test_load_missing_fields | Record missing required field | JSON with missing `gold_sql` | `ValueError` or `KeyError` | error | ### 1.14 SQLEnvironment._open_db | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_open_valid_db | Opens existing database | Valid db_name | `sqlite3.Connection` returned | happy | | test_open_missing_db | Missing database raises error | `"nonexistent_db"` | `FileNotFoundError` | error | | test_open_read_only | Connection is read-only | Valid db_name | Write operations fail | happy | ### 1.15 SQLEnvironment._build_observation | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_build_obs_reflects_context | Observation matches episode context | Various context states | Fields match context values | happy | | test_build_obs_schema_progressive | schema_info includes described columns | After DESCRIBE action | schema_info contains column details for described tables | happy | | test_build_obs_schema_initial | Initial schema_info has only table names | Before any DESCRIBE | Only table names, no columns | happy | **Run:** `pytest tests/unit/test_sql_environment.py -v` --- ## 2. Integration Tests ### Flow: Full Episode Lifecycle (Reset-Explore-Answer) | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `reset(seed=42)` | Observation with question, table names, step=0, budget=15 | Assert all initial observation fields | | 2 | `step(DESCRIBE, "employees")` | Column info in result, step=1, budget=14 | Assert result contains column names, step incremented | | 3 | `step(SAMPLE, "employees")` | Sample rows in result, step=2, budget=13 | Assert formatted rows in result | | 4 | `step(QUERY, "SELECT COUNT(*) FROM employees")` | Count result, step=3, budget=12 | Assert numeric result in output | | 5 | `step(ANSWER, "")` | done=True, reward=1.0 | Assert terminal state with positive reward | | 6 | Verify no more steps accepted | Post-episode step attempt handled gracefully | Assert error or episode-over signal | ### Flow: Budget Exhaustion | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `reset()` with budget=3 | Observation with budget=3 | Assert budget_remaining==3 | | 2 | `step(DESCRIBE, "t1")` | budget=2 | Assert budget_remaining==2 | | 3 | `step(DESCRIBE, "t2")` | budget=1 | Assert budget_remaining==1 | | 4 | `step(DESCRIBE, "t3")` | budget=0, done=True, reward=0.0 | Assert terminal state, zero reward | ### Flow: Error Recovery Within Episode | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `reset()` | Normal initial observation | Assert done==False | | 2 | `step(QUERY, "DROP TABLE x")` | Error in obs.error, done=False | Assert error message, episode continues | | 3 | `step(QUERY, "SELCET * FORM t")` | Error in obs.error, done=False | Assert SQL error message, episode continues | | 4 | `step(DESCRIBE, "nonexistent")` | Error in obs.error, done=False | Assert "not found" error, episode continues | | 5 | `step(QUERY, "SELECT 1")` | Success result, no error | Assert error=="" and result has data | ### Flow: Progressive Schema Discovery | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `reset()` | schema_info has table names only | Assert table names present, no column details | | 2 | `step(DESCRIBE, "table_a")` | schema_info now includes table_a columns | Assert column names for table_a appear | | 3 | `step(DESCRIBE, "table_b")` | schema_info includes both table_a and table_b columns | Assert both tables' columns present | ### Flow: Seed Determinism | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `reset(seed=42)` | Observation A | Record question and db | | 2 | `reset(seed=42)` | Observation B identical to A | Assert same question_text and schema_info | **Run:** `pytest tests/integration/test_episode_lifecycle.py -v` --- ## 3. API Tests ### POST /reset | Test | Request | Status | Response | Category | |------|---------|--------|----------|----------| | test_reset_no_params | `{}` | 200 | SQLObservation with done=False, budget=15 | happy | | test_reset_with_seed | `{"seed": 42}` | 200 | Deterministic SQLObservation | happy | | test_reset_with_episode_id | `{"episode_id": "ep-1"}` | 200 | SQLObservation (episode_id reflected in context) | happy | | test_reset_null_seed | `{"seed": null}` | 200 | Random question selected | happy | | test_reset_server_error_missing_db | (DB file removed) | 500 | Error response indicating database not found | error | | test_reset_response_schema | `{}` | 200 | Response contains all SQLObservation fields | happy | ### POST /step | Test | Request | Status | Response | Category | |------|---------|--------|----------|----------| | test_step_describe | `{"action_type": "DESCRIBE", "argument": "employees"}` | 200 | SQLObservation with schema result | happy | | test_step_sample | `{"action_type": "SAMPLE", "argument": "employees"}` | 200 | SQLObservation with sample rows | happy | | test_step_query | `{"action_type": "QUERY", "argument": "SELECT 1"}` | 200 | SQLObservation with query result | happy | | test_step_answer | `{"action_type": "ANSWER", "argument": "42"}` | 200 | SQLObservation with done=True | happy | | test_step_missing_action_type | `{"argument": "x"}` | 422 | Validation error | error | | test_step_missing_argument | `{"action_type": "QUERY"}` | 422 | Validation error | error | | test_step_empty_body | `{}` | 422 | Validation error | error | | test_step_invalid_action_type | `{"action_type": "HACK", "argument": "x"}` | 200 | SQLObservation with error field set | error | | test_step_without_reset | Call step before reset | 200 or 500 | Graceful error (not a crash) | error | **Run:** `pytest tests/api/test_endpoints.py -v` --- ## 4. E2E Tests ### Scenario: Agent Solves Easy Question **Setup:** Environment initialized with Spider questions and databases. Select an easy question with a known answer. **Actions:** 1. POST /reset with seed that selects an easy question 2. POST /step with DESCRIBE on relevant table 3. POST /step with SAMPLE on relevant table 4. POST /step with QUERY using correct SQL 5. POST /step with ANSWER providing correct value **Expected:** reward=1.0, done=True, total steps <= 15 ### Scenario: Agent Exhausts Budget **Setup:** Environment initialized with budget=3. **Actions:** 1. POST /reset 2. POST /step with DESCRIBE x3 **Expected:** After 3rd step: done=True, reward=0.0, budget_remaining=0 ### Scenario: Agent Submits Wrong Answer **Setup:** Environment initialized normally. **Actions:** 1. POST /reset 2. POST /step with ANSWER providing deliberately wrong value **Expected:** done=True, reward=0.0 ### Scenario: Agent Recovers From Errors **Setup:** Environment initialized normally. **Actions:** 1. POST /reset 2. POST /step with QUERY "DROP TABLE x" (rejected) 3. POST /step with DESCRIBE "nonexistent_table" (error) 4. POST /step with valid QUERY 5. POST /step with correct ANSWER **Expected:** Error steps counted against budget, valid steps succeed, episode completes with reward=1.0 **Run:** `pytest tests/e2e/test_full_episodes.py -v` --- ## 5. Edge Cases Checklist - [ ] Null/None seed and episode_id in reset - [ ] Empty string argument for all action types - [ ] Whitespace-only argument for all action types - [ ] Very long SQL query (10KB+) - [ ] SQL with unicode characters in string literals - [ ] SQL with special characters (backticks, double quotes, brackets) - [ ] Table name with special characters or spaces - [ ] ANSWER with numeric value as string vs integer - [ ] ANSWER with leading zeros ("042" vs "42") - [ ] ANSWER with trailing decimal ("42.0" vs "42") - [ ] Budget of 1 (single step before exhaustion) - [ ] Budget of 0 (should episode immediately end?) - [ ] Calling step after episode is done - [ ] Calling reset multiple times without stepping - [ ] Concurrent episodes (if supported) - [ ] Database with zero tables - [ ] Database with very large tables (performance) - [ ] Question whose gold_sql returns empty result - [ ] SELECT with subqueries, CTEs, UNION - [ ] SQL injection attempts via argument field (`"; DROP TABLE--`) - [ ] Very long table name in DESCRIBE/SAMPLE --- ## 6. Evidence Requirements | Category | Evidence Type | Example | |----------|---------------|---------| | Unit tests | pytest output | `45 passed` | | Integration | pytest output | `5 passed` | | API tests | pytest output or httpx test client | `9 passed` | | E2E | pytest output with real Spider DB | `4 passed` | | SQL sandbox | Demonstrate write rejection | `ValueError: Only SELECT queries are allowed` | | Timeout | Demonstrate query interruption | `Query timed out after 5.0 seconds` | | Budget | Demonstrate forced termination | `done=True, reward=0.0 at budget=0` | | Answer comparison | Demonstrate case-insensitive match | `"PARIS" == "paris" -> reward=1.0` |