# Verification Specification

**Feature:** F001
**Generated from:** specs/F001-VERIFICATION_INPUT.json
**Generated:** 2026-03-24

---

## 1. Unit Tests

### 1.1 SQLAction Type

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_sqlaction_valid_describe | Create action with DESCRIBE type | `SQLAction(action_type="DESCRIBE", argument="employees")` | Fields set correctly | happy |
| test_sqlaction_valid_sample | Create action with SAMPLE type | `SQLAction(action_type="SAMPLE", argument="orders")` | Fields set correctly | happy |
| test_sqlaction_valid_query | Create action with QUERY type | `SQLAction(action_type="QUERY", argument="SELECT 1")` | Fields set correctly | happy |
| test_sqlaction_valid_answer | Create action with ANSWER type | `SQLAction(action_type="ANSWER", argument="42")` | Fields set correctly | happy |
| test_sqlaction_empty_argument | Argument is empty string | `SQLAction(action_type="QUERY", argument="")` | Accepted at type level (validation at step) | edge |
| test_sqlaction_whitespace_argument | Argument is only whitespace | `SQLAction(action_type="QUERY", argument="   ")` | Accepted at type level | edge |
| test_sqlaction_serialization | Round-trip JSON serialization | Create, serialize, deserialize | Identical fields | happy |

### 1.2 SQLObservation Type

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_obs_all_fields_present | All required fields populated | Full observation construction | All fields accessible with correct types | happy |
| test_obs_done_false_initial | Initial observation has done=False | After reset | `done == False` | happy |
| test_obs_reward_none_nonterminal | Non-terminal obs has reward=None | After non-ANSWER step | `reward is None` | happy |
| test_obs_reward_set_terminal | Terminal obs has numeric reward | After ANSWER step | `reward in {0.0, 1.0}` | happy |
| test_obs_step_count_type | step_count is int | Any observation | `isinstance(step_count, int)` | happy |
| test_obs_budget_remaining_type | budget_remaining is int | Any observation | `isinstance(budget_remaining, int)` | happy |
| test_obs_action_history_list | action_history is a list of strings | After several steps | `isinstance(action_history, list)` and all elements are `str` | happy |
| test_obs_error_empty_on_success | error is empty string on success | After successful action | `error == ""` | happy |
| test_obs_schema_info_nonempty | schema_info is non-empty after reset | After reset | `len(schema_info) > 0` | happy |

### 1.3 QuestionRecord Type

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_qr_all_fields | All fields populated from valid JSON | Parsed question JSON | All 8 fields present and correct types | happy |
| test_qr_difficulty_values | difficulty in allowed set | Various records | `difficulty in {"easy", "medium", "hard"}` | happy |
| test_qr_answer_type_values | answer_type in allowed set | Various records | `answer_type in {"integer", "float", "string", "list"}` | happy |
| test_qr_tables_involved_nonempty | tables_involved has at least one entry | Valid question | `len(tables_involved) >= 1` | happy |

### 1.4 EpisodeContext Type

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_ctx_initial_step_count | step_count starts at 0 | After reset | `step_count == 0` | happy |
| test_ctx_initial_budget | budget starts at configured value (default 15) | After reset | `budget == 15` | happy |
| test_ctx_described_tables_empty | described_tables starts empty | After reset | `len(described_tables) == 0` | happy |
| test_ctx_action_log_empty | action_log starts empty | After reset | `len(action_log) == 0` | happy |
| test_ctx_done_false_initial | done starts False | After reset | `done == False` | happy |
| test_ctx_gold_answer_computed | gold_answer is set after reset | After reset | `gold_answer is not None` | happy |

### 1.5 SQLEnvironment.__init__

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_init_valid | Init with valid paths loads questions | Valid questions_path, db_dir | `len(questions) > 0` | happy |
| test_init_missing_questions_file | questions_path does not exist | `"/nonexistent/q.json"` | `FileNotFoundError` | error |
| test_init_missing_db_dir | db_dir does not exist | `"/nonexistent/dbs/"` | `FileNotFoundError` or `ValueError` | error |
| test_init_invalid_json | questions file is not valid JSON | Path to file with `"{bad"` | `ValueError` | error |
| test_init_empty_questions | questions file is empty array `[]` | Path to file with `"[]"` | `ValueError` (no questions) | error |
| test_init_custom_budget | Custom step_budget is stored | `step_budget=10` | Environment uses 10-step budget | happy |
| test_init_default_budget | Default step_budget is 15 | Omit step_budget | Budget is 15 | happy |

### 1.6 SQLEnvironment.reset

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_reset_returns_observation | Returns SQLObservation | `reset()` | `isinstance(result, SQLObservation)` | happy |
| test_reset_obs_has_question | Observation contains question text | `reset()` | `len(obs.question) > 0` | happy |
| test_reset_obs_has_table_names | schema_info contains table names | `reset()` | schema_info mentions at least one table | happy |
| test_reset_obs_no_columns | schema_info does NOT reveal columns initially | `reset()` | No column names in schema_info | happy |
| test_reset_obs_done_false | done is False | `reset()` | `obs.done == False` | happy |
| test_reset_obs_reward_none | reward is None | `reset()` | `obs.reward is None` | happy |
| test_reset_obs_step_count_zero | step_count is 0 | `reset()` | `obs.step_count == 0` | happy |
| test_reset_obs_budget_full | budget_remaining equals configured budget | `reset()` | `obs.budget_remaining == 15` | happy |
| test_reset_obs_empty_history | action_history is empty | `reset()` | `obs.action_history == []` | happy |
| test_reset_obs_empty_error | error is empty string | `reset()` | `obs.error == ""` | happy |
| test_reset_seed_determinism | Same seed yields same question | `reset(seed=42)` twice | Same question both times | happy |
| test_reset_different_seeds | Different seeds can yield different questions | `reset(seed=1)` vs `reset(seed=999)` | Not necessarily same question (probabilistic) | happy |
| test_reset_no_seed_random | Without seed, question is randomly selected | `reset()` multiple times | At least one different question (probabilistic) | happy |
| test_reset_episode_id_set | Custom episode_id is reflected | `reset(episode_id="ep-123")` | Context has episode_id="ep-123" | happy |
| test_reset_missing_db_file | Question references nonexistent DB | Question with bad db_name | `FileNotFoundError` | error |
| test_reset_clears_previous_episode | Calling reset mid-episode starts fresh | reset, step, reset | Second reset has step_count=0, empty history | happy |

### 1.7 SQLEnvironment.step

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_step_returns_observation | Returns SQLObservation | Any valid action | `isinstance(result, SQLObservation)` | happy |
| test_step_never_raises | step never raises exceptions | Various invalid inputs | Returns obs with error field, no exception | happy |
| test_step_increments_step_count | step_count increases by 1 | DESCRIBE action | `obs.step_count == 1` | happy |
| test_step_decrements_budget | budget_remaining decreases by 1 | DESCRIBE/SAMPLE/QUERY | `obs.budget_remaining == 14` | happy |
| test_step_answer_no_budget_decrement | ANSWER does not decrement budget | ANSWER action | budget unchanged from before | happy |
| test_step_appends_action_history | action_history grows each step | Two DESCRIBE actions | `len(obs.action_history) == 2` | happy |
| test_step_invalid_action_type | Unknown action_type returns error | `action_type="UNKNOWN"` | `obs.error` contains "Unknown action type" and lists valid types | error |
| test_step_empty_argument | Empty argument returns error | `argument=""` | `obs.error` contains "cannot be empty" | error |
| test_step_whitespace_argument | Whitespace-only argument returns error | `argument="   "` | `obs.error` contains "cannot be empty" | error |

### 1.8 SQLEnvironment._execute_sql

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_exec_valid_select | Executes a valid SELECT | `"SELECT 1"` | Returns list with results | happy |
| test_exec_non_select_rejected | Non-SELECT is rejected | `"DROP TABLE x"` | `ValueError` raised | error |
| test_exec_insert_rejected | INSERT is rejected | `"INSERT INTO t VALUES(1)"` | `ValueError` raised | error |
| test_exec_update_rejected | UPDATE is rejected | `"UPDATE t SET x=1"` | `ValueError` raised | error |
| test_exec_delete_rejected | DELETE is rejected | `"DELETE FROM t"` | `ValueError` raised | error |
| test_exec_create_rejected | CREATE is rejected | `"CREATE TABLE t(x INT)"` | `ValueError` raised | error |
| test_exec_semicolon_multi | Multiple statements rejected | `"SELECT 1; DROP TABLE t"` | Error or only first executed safely | error |
| test_exec_syntax_error | Malformed SQL | `"SELCET * FORM t"` | `sqlite3.OperationalError` | error |
| test_exec_timeout | Query exceeding timeout is interrupted | Long-running query (e.g., cartesian join) | Error raised within ~5 seconds | error |
| test_exec_result_truncation | Results truncated to 20 rows | Query returning 100 rows | `len(result) <= 20` | edge |
| test_exec_empty_result | Query returning no rows | `"SELECT * FROM t WHERE 1=0"` | Empty list `[]` | edge |
| test_exec_read_only | Connection is truly read-only | Attempt write via raw SQL | Error raised | error |
| test_exec_case_insensitive_select | `select` (lowercase) is accepted | `"select 1"` | Returns results | edge |
| test_exec_select_with_leading_whitespace | Leading whitespace before SELECT | `"  SELECT 1"` | Returns results | edge |
| test_exec_select_with_comment | SQL comment before SELECT | `"-- comment\nSELECT 1"` | Handled correctly (accepted or rejected consistently) | edge |

### 1.9 SQLEnvironment._handle_describe

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_describe_valid_table | Known table returns schema info | Existing table name | Result contains column names, types, row count | happy |
| test_describe_unknown_table | Unknown table returns error with list | `"nonexistent_table"` | Error containing "not found" and listing available tables | error |
| test_describe_case_sensitivity | Table name matching case behavior | Mixed case table name | Consistent behavior (either matches or provides helpful error) | edge |
| test_describe_updates_described_set | Described table is tracked | DESCRIBE a table | Table appears in described_tables set | happy |
| test_describe_repeated_table | Describing same table twice works | DESCRIBE same table twice | No error, returns same info | edge |

### 1.10 SQLEnvironment._handle_sample

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_sample_valid_table | Known table returns sample rows | Existing table name | Formatted rows returned | happy |
| test_sample_unknown_table | Unknown table returns error | `"nonexistent_table"` | Error with available table list | error |
| test_sample_default_limit | Default limit is 5 rows | Omit limit parameter | At most 5 rows returned | happy |
| test_sample_empty_table | Table with no rows | Empty table | Returns empty or "no rows" result | edge |

### 1.11 SQLEnvironment._handle_query

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_query_valid_select | Valid SELECT returns formatted results | `"SELECT * FROM t LIMIT 3"` | Formatted result string | happy |
| test_query_non_select | Non-SELECT returns error | `"DROP TABLE t"` | Error about SELECT-only | error |
| test_query_syntax_error | Bad SQL returns error | `"SELCET *"` | Error with sqlite3 message | error |
| test_query_timeout | Slow query returns timeout error | Expensive query | Error mentioning "timed out" and "5.0 seconds" | error |
| test_query_truncation_indicator | >20 rows shows truncation notice | Query returning many rows | Result indicates truncation occurred | edge |
| test_query_exactly_20_rows | Exactly 20 rows shows no truncation | Query returning 20 rows | No truncation indicator | edge |

### 1.12 SQLEnvironment._handle_answer

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_answer_correct | Correct answer yields reward 1.0 | Gold answer value | `(True, 1.0)` | happy |
| test_answer_incorrect | Incorrect answer yields reward 0.0 | Wrong value | `(False, 0.0)` | happy |
| test_answer_case_insensitive | Case-insensitive comparison | `"PARIS"` vs gold `"paris"` | `(True, 1.0)` | edge |
| test_answer_whitespace_stripped | Leading/trailing whitespace stripped | `"  42  "` vs gold `"42"` | `(True, 1.0)` | edge |
| test_answer_sets_done | Episode is marked done after answer | Any answer | `done == True` | happy |
| test_answer_empty_string | Empty answer handled | `""` | `(False, 0.0)` | edge |

### 1.13 SQLEnvironment._load_questions

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_load_valid_json | Valid questions JSON parsed | Path to valid file | `list[QuestionRecord]` with correct count | happy |
| test_load_file_not_found | Missing file raises error | `"/nonexistent.json"` | `FileNotFoundError` | error |
| test_load_invalid_json | Malformed JSON raises error | Path to `"{bad"` | `ValueError` | error |
| test_load_missing_fields | Record missing required field | JSON with missing `gold_sql` | `ValueError` or `KeyError` | error |

### 1.14 SQLEnvironment._open_db

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_open_valid_db | Opens existing database | Valid db_name | `sqlite3.Connection` returned | happy |
| test_open_missing_db | Missing database raises error | `"nonexistent_db"` | `FileNotFoundError` | error |
| test_open_read_only | Connection is read-only | Valid db_name | Write operations fail | happy |

### 1.15 SQLEnvironment._build_observation

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_build_obs_reflects_context | Observation matches episode context | Various context states | Fields match context values | happy |
| test_build_obs_schema_progressive | schema_info includes described columns | After DESCRIBE action | schema_info contains column details for described tables | happy |
| test_build_obs_schema_initial | Initial schema_info has only table names | Before any DESCRIBE | Only table names, no columns | happy |

**Run:** `pytest tests/unit/test_sql_environment.py -v`

---

## 2. Integration Tests

### Flow: Full Episode Lifecycle (Reset-Explore-Answer)

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `reset(seed=42)` | Observation with question, table names, step=0, budget=15 | Assert all initial observation fields |
| 2 | `step(DESCRIBE, "employees")` | Column info in result, step=1, budget=14 | Assert result contains column names, step incremented |
| 3 | `step(SAMPLE, "employees")` | Sample rows in result, step=2, budget=13 | Assert formatted rows in result |
| 4 | `step(QUERY, "SELECT COUNT(*) FROM employees")` | Count result, step=3, budget=12 | Assert numeric result in output |
| 5 | `step(ANSWER, "<correct_value>")` | done=True, reward=1.0 | Assert terminal state with positive reward |
| 6 | Verify no more steps accepted | Post-episode step attempt handled gracefully | Assert error or episode-over signal |

### Flow: Budget Exhaustion

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `reset()` with budget=3 | Observation with budget=3 | Assert budget_remaining==3 |
| 2 | `step(DESCRIBE, "t1")` | budget=2 | Assert budget_remaining==2 |
| 3 | `step(DESCRIBE, "t2")` | budget=1 | Assert budget_remaining==1 |
| 4 | `step(DESCRIBE, "t3")` | budget=0, done=True, reward=0.0 | Assert terminal state, zero reward |

### Flow: Error Recovery Within Episode

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `reset()` | Normal initial observation | Assert done==False |
| 2 | `step(QUERY, "DROP TABLE x")` | Error in obs.error, done=False | Assert error message, episode continues |
| 3 | `step(QUERY, "SELCET * FORM t")` | Error in obs.error, done=False | Assert SQL error message, episode continues |
| 4 | `step(DESCRIBE, "nonexistent")` | Error in obs.error, done=False | Assert "not found" error, episode continues |
| 5 | `step(QUERY, "SELECT 1")` | Success result, no error | Assert error=="" and result has data |

### Flow: Progressive Schema Discovery

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `reset()` | schema_info has table names only | Assert table names present, no column details |
| 2 | `step(DESCRIBE, "table_a")` | schema_info now includes table_a columns | Assert column names for table_a appear |
| 3 | `step(DESCRIBE, "table_b")` | schema_info includes both table_a and table_b columns | Assert both tables' columns present |

### Flow: Seed Determinism

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `reset(seed=42)` | Observation A | Record question and db |
| 2 | `reset(seed=42)` | Observation B identical to A | Assert same question_text and schema_info |

**Run:** `pytest tests/integration/test_episode_lifecycle.py -v`

---

## 3. API Tests

### POST /reset

| Test | Request | Status | Response | Category |
|------|---------|--------|----------|----------|
| test_reset_no_params | `{}` | 200 | SQLObservation with done=False, budget=15 | happy |
| test_reset_with_seed | `{"seed": 42}` | 200 | Deterministic SQLObservation | happy |
| test_reset_with_episode_id | `{"episode_id": "ep-1"}` | 200 | SQLObservation (episode_id reflected in context) | happy |
| test_reset_null_seed | `{"seed": null}` | 200 | Random question selected | happy |
| test_reset_server_error_missing_db | (DB file removed) | 500 | Error response indicating database not found | error |
| test_reset_response_schema | `{}` | 200 | Response contains all SQLObservation fields | happy |

### POST /step

| Test | Request | Status | Response | Category |
|------|---------|--------|----------|----------|
| test_step_describe | `{"action_type": "DESCRIBE", "argument": "employees"}` | 200 | SQLObservation with schema result | happy |
| test_step_sample | `{"action_type": "SAMPLE", "argument": "employees"}` | 200 | SQLObservation with sample rows | happy |
| test_step_query | `{"action_type": "QUERY", "argument": "SELECT 1"}` | 200 | SQLObservation with query result | happy |
| test_step_answer | `{"action_type": "ANSWER", "argument": "42"}` | 200 | SQLObservation with done=True | happy |
| test_step_missing_action_type | `{"argument": "x"}` | 422 | Validation error | error |
| test_step_missing_argument | `{"action_type": "QUERY"}` | 422 | Validation error | error |
| test_step_empty_body | `{}` | 422 | Validation error | error |
| test_step_invalid_action_type | `{"action_type": "HACK", "argument": "x"}` | 200 | SQLObservation with error field set | error |
| test_step_without_reset | Call step before reset | 200 or 500 | Graceful error (not a crash) | error |

**Run:** `pytest tests/api/test_endpoints.py -v`

---

## 4. E2E Tests

### Scenario: Agent Solves Easy Question

**Setup:** Environment initialized with Spider questions and databases. Select an easy question with a known answer.
**Actions:**
1. POST /reset with seed that selects an easy question
2. POST /step with DESCRIBE on relevant table
3. POST /step with SAMPLE on relevant table
4. POST /step with QUERY using correct SQL
5. POST /step with ANSWER providing correct value
**Expected:** reward=1.0, done=True, total steps <= 15

### Scenario: Agent Exhausts Budget

**Setup:** Environment initialized with budget=3.
**Actions:**
1. POST /reset
2. POST /step with DESCRIBE x3
**Expected:** After 3rd step: done=True, reward=0.0, budget_remaining=0

### Scenario: Agent Submits Wrong Answer

**Setup:** Environment initialized normally.
**Actions:**
1. POST /reset
2. POST /step with ANSWER providing deliberately wrong value
**Expected:** done=True, reward=0.0

### Scenario: Agent Recovers From Errors

**Setup:** Environment initialized normally.
**Actions:**
1. POST /reset
2. POST /step with QUERY "DROP TABLE x" (rejected)
3. POST /step with DESCRIBE "nonexistent_table" (error)
4. POST /step with valid QUERY
5. POST /step with correct ANSWER
**Expected:** Error steps counted against budget, valid steps succeed, episode completes with reward=1.0

**Run:** `pytest tests/e2e/test_full_episodes.py -v`

---

## 5. Edge Cases Checklist

- [ ] Null/None seed and episode_id in reset
- [ ] Empty string argument for all action types
- [ ] Whitespace-only argument for all action types
- [ ] Very long SQL query (10KB+)
- [ ] SQL with unicode characters in string literals
- [ ] SQL with special characters (backticks, double quotes, brackets)
- [ ] Table name with special characters or spaces
- [ ] ANSWER with numeric value as string vs integer
- [ ] ANSWER with leading zeros ("042" vs "42")
- [ ] ANSWER with trailing decimal ("42.0" vs "42")
- [ ] Budget of 1 (single step before exhaustion)
- [ ] Budget of 0 (should episode immediately end?)
- [ ] Calling step after episode is done
- [ ] Calling reset multiple times without stepping
- [ ] Concurrent episodes (if supported)
- [ ] Database with zero tables
- [ ] Database with very large tables (performance)
- [ ] Question whose gold_sql returns empty result
- [ ] SELECT with subqueries, CTEs, UNION
- [ ] SQL injection attempts via argument field (`"; DROP TABLE--`)
- [ ] Very long table name in DESCRIBE/SAMPLE

---

## 6. Evidence Requirements

| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `45 passed` |
| Integration | pytest output | `5 passed` |
| API tests | pytest output or httpx test client | `9 passed` |
| E2E | pytest output with real Spider DB | `4 passed` |
| SQL sandbox | Demonstrate write rejection | `ValueError: Only SELECT queries are allowed` |
| Timeout | Demonstrate query interruption | `Query timed out after 5.0 seconds` |
| Budget | Demonstrate forced termination | `done=True, reward=0.0 at budget=0` |
| Answer comparison | Demonstrate case-insensitive match | `"PARIS" == "paris" -> reward=1.0` |