# Verification Specification **Feature:** F003 **Generated from:** specs/F003-VERIFICATION_INPUT.json **Generated:** 2026-03-27 --- ## 1. Unit Tests ### EpisodeContext (Type Extension) | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_episode_context_has_gold_rows | New field exists and defaults | `EpisodeContext(...)` | `gold_rows` is `[]` | happy | | test_episode_context_has_query_hashes | New field exists and defaults | `EpisodeContext(...)` | `query_hashes` is `set()` | happy | | test_episode_context_has_best_progress | New field exists and defaults | `EpisodeContext(...)` | `best_progress` is `0.0` | happy | | test_episode_context_has_cumulative_step_reward | New field exists and defaults | `EpisodeContext(...)` | `cumulative_step_reward` is `0.0` | happy | | test_episode_context_has_cumulative_new_info_reward | New field exists and defaults | `EpisodeContext(...)` | `cumulative_new_info_reward` is `0.0` | happy | | test_episode_context_gold_rows_accepts_tuples | Field stores tuple list | `gold_rows=[(1, "a"), (2, "b")]` | Stored correctly | happy | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "EpisodeContext"` --- ### _cardinality_score | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_cardinality_exact_match | Same row count | `pred=[(1,),(2,)], gold=[(3,),(4,)]` | `1.0` | happy | | test_cardinality_zero_pred | Empty prediction | `pred=[], gold=[(1,)]` | `0.0` | edge | | test_cardinality_zero_gold | Empty gold | `pred=[(1,)], gold=[]` | `0.0` | edge | | test_cardinality_both_empty | Both empty | `pred=[], gold=[]` | `1.0` (0/max(0,0,1)=0, 1-0=1) | edge | | test_cardinality_pred_larger | More pred rows | `pred=[(i,) for i in range(10)], gold=[(1,)]` | `0.1` (1-9/10) | boundary | | test_cardinality_gold_larger | More gold rows | `pred=[(1,)], gold=[(i,) for i in range(4)]` | `0.25` (1-3/4) | boundary | | test_cardinality_returns_float_in_range | Any input | Various | Result in `[0.0, 1.0]` | invariant | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "cardinality"` --- ### _value_overlap_score | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_value_overlap_identical | Same rows | `pred=[(1,"a")], gold=[(1,"a")]` | `1.0` | happy | | test_value_overlap_disjoint | No shared values | `pred=[(1,"x")], gold=[(2,"y")]` | `0.0` | edge | | test_value_overlap_partial | Some overlap | `pred=[(1,"a"),(2,"b")], gold=[(1,"a"),(3,"c")]` | Jaccard of `{"1","a","2","b"}` vs `{"1","a","3","c"}` = 2/6 ~ 0.333 | happy | | test_value_overlap_empty_pred | No pred rows | `pred=[], gold=[(1,)]` | `0.0` | edge | | test_value_overlap_empty_gold | No gold rows | `pred=[(1,)], gold=[]` | `0.0` | edge | | test_value_overlap_both_empty | Both empty | `pred=[], gold=[]` | `0.0` (empty Jaccard) or `1.0` (convention) | edge | | test_value_overlap_stringifies_values | Mixed types | `pred=[(1, 2.5, None)], gold=[(1, 2.5, None)]` | `1.0` (all stringify to same) | edge | | test_value_overlap_returns_float_in_range | Any input | Various | Result in `[0.0, 1.0]` | invariant | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "value_overlap"` --- ### _numeric_range_score | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_numeric_range_identical | Same numbers | `pred=[(10,)], gold=[(10,)]` | `1.0` | happy | | test_numeric_range_no_numerics_in_gold | Only strings in gold | `pred=[("a",)], gold=[("b",)]` | `1.0` (spec: returns 1.0 if no numerics in gold) | edge | | test_numeric_range_close_values | Near match | `pred=[(11,)], gold=[(10,)]` | Close to 1.0 (1/(1+log(1+1)) ~ 0.59) | happy | | test_numeric_range_far_values | Very different | `pred=[(1000000,)], gold=[(1,)]` | Near 0.0 | boundary | | test_numeric_range_zero_distance | Exact match numerics | `pred=[(0,)], gold=[(0,)]` | `1.0` (1/(1+log(1+0))=1) | edge | | test_numeric_range_negative_numbers | Negative values | `pred=[(-5,)], gold=[(5,)]` | Uses absolute difference `|(-5)-5|=10` | edge | | test_numeric_range_mixed_types | Some numeric some not | `pred=[(10,"a")], gold=[(10,"b")]` | Score based only on numeric columns | edge | | test_numeric_range_empty_pred | No pred rows | `pred=[], gold=[(1,)]` | Gracefully handle, likely `0.0` | edge | | test_numeric_range_returns_float_in_range | Any input | Various | Result in `[0.0, 1.0]` | invariant | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "numeric_range"` --- ### _bin_progress | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_bin_progress_zero | Score 0.0 | `0.0` | `0.0` (below 0.125) | boundary | | test_bin_progress_low | Score 0.124 | `0.124` | `0.0` | boundary | | test_bin_progress_boundary_0125 | Score exactly 0.125 | `0.125` | `0.25` | boundary | | test_bin_progress_mid_low | Score 0.3 | `0.3` | `0.25` (between 0.125 and 0.375) | happy | | test_bin_progress_boundary_0375 | Score exactly 0.375 | `0.375` | `0.5` | boundary | | test_bin_progress_mid | Score 0.5 | `0.5` | `0.5` (between 0.375 and 0.625) | happy | | test_bin_progress_boundary_0625 | Score exactly 0.625 | `0.625` | `0.75` | boundary | | test_bin_progress_mid_high | Score 0.7 | `0.7` | `0.75` | happy | | test_bin_progress_boundary_0875 | Score exactly 0.875 | `0.875` | `1.0` | boundary | | test_bin_progress_one | Score 1.0 | `1.0` | `1.0` | boundary | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "bin_progress"` --- ### _layer1_operational | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_layer1_successful_query | exec_ok + step_cost | `action_type="QUERY", rows=[(1,)], error=None, new sql` | `+0.02 - 0.005 = +0.015` (plus possible new_info) | happy | | test_layer1_successful_describe | exec_ok + step_cost | `action_type="DESCRIBE", rows=..., error=None` | `+0.02 - 0.005 = +0.015` | happy | | test_layer1_successful_sample | exec_ok + step_cost | `action_type="SAMPLE", rows=..., error=None` | `+0.02 - 0.005 = +0.015` | happy | | test_layer1_error_query | step_cost only | `error="some error", rows=None` | `-0.005` | error | | test_layer1_new_info_reward | First unique SQL | `new sql hash, rows not None` | Includes `+0.01` new_info | happy | | test_layer1_new_info_capped | Cap at 0.10 | Execute 11+ unique queries | `cumulative_new_info_reward` does not exceed `0.10` | boundary | | test_layer1_repeat_penalty | Same SQL twice | Submit same SQL hash twice | Second call includes `-0.01` repeat | error | | test_layer1_repeat_no_exec_ok | Repeated query skips exec_ok | Same SQL hash as before | No `+0.02` bonus | edge | | test_layer1_step_cost_always_applied | Step cost on every call | Any action | Always includes `-0.005` | invariant | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "layer1"` --- ### _layer2_progress | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_layer2_perfect_match | All sub-metrics = 1.0 | `rows == gold_rows` (exact match) | Binned 1.0, improvement from 0 = 1.0, scaled by 0.15 = `0.15` | happy | | test_layer2_no_improvement | Same binned score as best | Second identical query | `0.0` (no improvement over best_progress) | edge | | test_layer2_improvement_only | New bin > best | First query close, second closer | Reward = `(new_bin - best_progress) * 0.15` | happy | | test_layer2_empty_gold_rows | Gold is empty | `ctx.gold_rows = []` | `0.0` | edge | | test_layer2_weighted_average | Check weight formula | Known sub-metric values | `0.25*card + 0.50*overlap + 0.25*numeric` | happy | | test_layer2_updates_best_progress | Mutates ctx | Query improves progress | `ctx.best_progress` updated to new bin | happy | | test_layer2_does_not_downgrade_best | Worse query after good | Good query then bad query | `ctx.best_progress` stays at higher value | edge | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "layer2"` --- ### compute_step_reward | Test | Description | Input | Expected | Category | |------|-------------|-------|----------|----------| | test_compute_reward_query_success | Layer 1 + Layer 2 combined | QUERY with valid rows, gold_rows set | Sum of L1 + L2, clamped | happy | | test_compute_reward_query_error | Layer 1 only, no Layer 2 | QUERY with error | `-0.005` (step_cost only) | error | | test_compute_reward_describe | Layer 1 only, no Layer 2 | DESCRIBE action | L1 signal only | happy | | test_compute_reward_sample | Layer 1 only, no Layer 2 | SAMPLE action | L1 signal only | happy | | test_compute_reward_clamp_upper | Cumulative capped at +0.5 | Many successful improving queries | Cumulative never exceeds `+0.5` | boundary | | test_compute_reward_clamp_lower | Cumulative floored at -0.2 | Many errors in a row | Cumulative never goes below `-0.2` | boundary | | test_compute_reward_clamp_returns_delta | Step reward reflects clamp | Cumulative at 0.49, next step would add 0.05 | Returns `0.01` (clamped to 0.5) | boundary | | test_compute_reward_mutates_ctx | Updates tracking fields | Any call | `ctx.cumulative_step_reward` updated | happy | | test_compute_reward_layer2_skipped_for_describe | No progress calc for non-QUERY | DESCRIBE with rows | Layer 2 not called | happy | | test_compute_reward_layer2_skipped_when_rows_none | No progress calc on error | QUERY, rows=None | Layer 2 not called | edge | | test_compute_reward_layer2_skipped_empty_gold | No progress with empty gold | QUERY, gold_rows=[] | Layer 2 returns 0.0 | edge | **Run:** `uv run pytest tests/unit/test_reward.py -v -k "compute_step_reward"` --- ## 2. Integration Tests ### Flow: Primary Reward Computation Through step() | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `env.reset(seed=42)` | Episode created, `gold_rows` populated from gold SQL | `ctx.gold_rows` is non-empty list of tuples | | 2 | `env.step(DESCRIBE employees)` | Step reward from Layer 1 only | `observation.reward` is None (non-terminal), but internal reward tracked | | 3 | `env.step(QUERY "SELECT COUNT(*) FROM employees")` | Layer 1 + Layer 2 computed | Progress score reflects cardinality/value/numeric comparison to gold | | 4 | `env.step(QUERY same_sql_again)` | Repeat penalty applied | Lower reward than step 3 | | 5 | `env.step(ANSWER correct_value)` | Terminal reward = 1.0 | `observation.done=True, observation.reward=1.0` | **Run:** `uv run pytest tests/integration/test_reward_flow.py -v` --- ### Flow: SQL Error Handling | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `env.reset(seed=42)` | Episode active | Episode context initialized | | 2 | `env.step(QUERY "SELECT nonexistent FROM employees")` | Error caught, step_cost only | Reward is `-0.005`, Layer 2 not computed | | 3 | `env.step(QUERY valid_query)` | Normal reward resumes | Layer 1 + Layer 2 computed normally | **Run:** `uv run pytest tests/integration/test_reward_flow.py -v -k "error"` --- ### Flow: Empty Gold Rows | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | Reset with question whose gold SQL returns empty | `ctx.gold_rows == []` | gold_rows stored as empty list | | 2 | `env.step(QUERY any_query)` | Layer 1 operates, Layer 2 returns 0.0 | Reward is Layer 1 signal only | **Run:** `uv run pytest tests/integration/test_reward_flow.py -v -k "empty_gold"` --- ### Flow: Repeated Query Detection | Step | Action | Expected | Verification | |------|--------|----------|--------------| | 1 | `env.reset(seed=42)` | Fresh episode | `ctx.query_hashes` is empty | | 2 | `env.step(QUERY "SELECT 1")` | Hash added, no repeat penalty | `ctx.query_hashes` has 1 entry | | 3 | `env.step(QUERY "SELECT 1")` | Same hash detected, repeat penalty | Reward includes `-0.01`, no exec_ok | | 4 | `env.step(QUERY "SELECT 2")` | New hash, no repeat penalty | Normal reward, `ctx.query_hashes` has 2 entries | **Run:** `uv run pytest tests/integration/test_reward_flow.py -v -k "repeat"` --- ## 3. API Tests No API endpoints defined for F003. The reward system is internal server-side logic. --- ## 4. E2E Tests ### Scenario: Random Exploration Yields ~0.1 Cumulative Reward **Setup:** Environment reset with a known question. **Actions:** Execute 10 random DESCRIBE/SAMPLE/QUERY actions (no targeted queries). **Expected:** Cumulative step reward is approximately 0.1 (within [0.0, 0.2]). **Run:** `uv run pytest tests/e2e/test_reward_scenarios.py -v -k "random_exploration"` --- ### Scenario: Targeted Queries Yield ~0.3 Cumulative Reward **Setup:** Environment reset with a known question. **Actions:** Execute targeted queries that progressively approach the gold answer. **Expected:** Cumulative step reward is approximately 0.3 (within [0.2, 0.5]). **Run:** `uv run pytest tests/e2e/test_reward_scenarios.py -v -k "targeted_queries"` --- ### Scenario: Correct Answer Yields ~1.3 Total Reward **Setup:** Environment reset with a known question. **Actions:** Execute targeted queries, then ANSWER correctly. **Expected:** Total reward (cumulative step + terminal 1.0) is approximately 1.3 (within [1.0, 1.5]). **Run:** `uv run pytest tests/e2e/test_reward_scenarios.py -v -k "correct_answer"` --- ## 5. Edge Cases Checklist - [ ] Null/None rows passed to compute_step_reward (SQL error case) - [ ] Empty result rows from a valid query (e.g., `SELECT * FROM t WHERE 1=0`) - [ ] Single-row gold vs multi-row prediction - [ ] Multi-row gold vs single-row prediction - [ ] Gold rows with only non-numeric values (numeric_range returns 1.0) - [ ] Gold rows with mixed numeric and string columns - [ ] Very large numeric values (boundary for log-distance formula) - [ ] Negative numeric values in gold or prediction - [ ] Float vs integer comparison in numeric range (e.g., `10` vs `10.0`) - [ ] None/NULL values in result tuples (stringification for value_overlap) - [ ] SQL strings that differ only by whitespace (hash should differ or normalize) - [ ] Cumulative new_info exactly at cap (0.10) -- next unique query gets 0 - [ ] Cumulative step reward exactly at clamp boundary (-0.2 or +0.5) - [ ] Layer 2 called with pred_rows and gold_rows of different column counts - [ ] _bin_progress with values outside [0, 1] (e.g., negative or > 1.0 from rounding) - [ ] Concurrent episodes (if supported) -- each has independent tracking fields --- ## 6. Evidence Requirements | Category | Evidence Type | Example | |----------|---------------|---------| | Unit tests | pytest output | `uv run pytest tests/unit/test_reward.py -v` shows `X passed` | | Integration | pytest output | `uv run pytest tests/integration/test_reward_flow.py -v` shows `X passed` | | E2E | pytest output | `uv run pytest tests/e2e/test_reward_scenarios.py -v` shows `X passed` | | Reward calibration | Logged values | Random exploration ~0.1, targeted ~0.3, correct ~1.3 | | Existing tests | pytest output | `uv run pytest tests/test_smoke.py -v` still passes (no regressions) |