Verification Specification
Feature: F003 Generated from: specs/F003-VERIFICATION_INPUT.json Generated: 2026-03-27
1. Unit Tests
EpisodeContext (Type Extension)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_episode_context_has_gold_rows | New field exists and defaults | EpisodeContext(...) |
gold_rows is [] |
happy |
| test_episode_context_has_query_hashes | New field exists and defaults | EpisodeContext(...) |
query_hashes is set() |
happy |
| test_episode_context_has_best_progress | New field exists and defaults | EpisodeContext(...) |
best_progress is 0.0 |
happy |
| test_episode_context_has_cumulative_step_reward | New field exists and defaults | EpisodeContext(...) |
cumulative_step_reward is 0.0 |
happy |
| test_episode_context_has_cumulative_new_info_reward | New field exists and defaults | EpisodeContext(...) |
cumulative_new_info_reward is 0.0 |
happy |
| test_episode_context_gold_rows_accepts_tuples | Field stores tuple list | gold_rows=[(1, "a"), (2, "b")] |
Stored correctly | happy |
Run: uv run pytest tests/unit/test_reward.py -v -k "EpisodeContext"
_cardinality_score
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_cardinality_exact_match | Same row count | pred=[(1,),(2,)], gold=[(3,),(4,)] |
1.0 |
happy |
| test_cardinality_zero_pred | Empty prediction | pred=[], gold=[(1,)] |
0.0 |
edge |
| test_cardinality_zero_gold | Empty gold | pred=[(1,)], gold=[] |
0.0 |
edge |
| test_cardinality_both_empty | Both empty | pred=[], gold=[] |
1.0 (0/max(0,0,1)=0, 1-0=1) |
edge |
| test_cardinality_pred_larger | More pred rows | pred=[(i,) for i in range(10)], gold=[(1,)] |
0.1 (1-9/10) |
boundary |
| test_cardinality_gold_larger | More gold rows | pred=[(1,)], gold=[(i,) for i in range(4)] |
0.25 (1-3/4) |
boundary |
| test_cardinality_returns_float_in_range | Any input | Various | Result in [0.0, 1.0] |
invariant |
Run: uv run pytest tests/unit/test_reward.py -v -k "cardinality"
_value_overlap_score
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_value_overlap_identical | Same rows | pred=[(1,"a")], gold=[(1,"a")] |
1.0 |
happy |
| test_value_overlap_disjoint | No shared values | pred=[(1,"x")], gold=[(2,"y")] |
0.0 |
edge |
| test_value_overlap_partial | Some overlap | pred=[(1,"a"),(2,"b")], gold=[(1,"a"),(3,"c")] |
Jaccard of {"1","a","2","b"} vs {"1","a","3","c"} = 2/6 ~ 0.333 |
happy |
| test_value_overlap_empty_pred | No pred rows | pred=[], gold=[(1,)] |
0.0 |
edge |
| test_value_overlap_empty_gold | No gold rows | pred=[(1,)], gold=[] |
0.0 |
edge |
| test_value_overlap_both_empty | Both empty | pred=[], gold=[] |
0.0 (empty Jaccard) or 1.0 (convention) |
edge |
| test_value_overlap_stringifies_values | Mixed types | pred=[(1, 2.5, None)], gold=[(1, 2.5, None)] |
1.0 (all stringify to same) |
edge |
| test_value_overlap_returns_float_in_range | Any input | Various | Result in [0.0, 1.0] |
invariant |
Run: uv run pytest tests/unit/test_reward.py -v -k "value_overlap"
_numeric_range_score
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_numeric_range_identical | Same numbers | pred=[(10,)], gold=[(10,)] |
1.0 |
happy |
| test_numeric_range_no_numerics_in_gold | Only strings in gold | pred=[("a",)], gold=[("b",)] |
1.0 (spec: returns 1.0 if no numerics in gold) |
edge |
| test_numeric_range_close_values | Near match | pred=[(11,)], gold=[(10,)] |
Close to 1.0 (1/(1+log(1+1)) ~ 0.59) | happy |
| test_numeric_range_far_values | Very different | pred=[(1000000,)], gold=[(1,)] |
Near 0.0 | boundary |
| test_numeric_range_zero_distance | Exact match numerics | pred=[(0,)], gold=[(0,)] |
1.0 (1/(1+log(1+0))=1) |
edge |
| test_numeric_range_negative_numbers | Negative values | pred=[(-5,)], gold=[(5,)] |
Uses absolute difference ` | (-5)-5 |
| test_numeric_range_mixed_types | Some numeric some not | pred=[(10,"a")], gold=[(10,"b")] |
Score based only on numeric columns | edge |
| test_numeric_range_empty_pred | No pred rows | pred=[], gold=[(1,)] |
Gracefully handle, likely 0.0 |
edge |
| test_numeric_range_returns_float_in_range | Any input | Various | Result in [0.0, 1.0] |
invariant |
Run: uv run pytest tests/unit/test_reward.py -v -k "numeric_range"
_bin_progress
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_bin_progress_zero | Score 0.0 | 0.0 |
0.0 (below 0.125) |
boundary |
| test_bin_progress_low | Score 0.124 | 0.124 |
0.0 |
boundary |
| test_bin_progress_boundary_0125 | Score exactly 0.125 | 0.125 |
0.25 |
boundary |
| test_bin_progress_mid_low | Score 0.3 | 0.3 |
0.25 (between 0.125 and 0.375) |
happy |
| test_bin_progress_boundary_0375 | Score exactly 0.375 | 0.375 |
0.5 |
boundary |
| test_bin_progress_mid | Score 0.5 | 0.5 |
0.5 (between 0.375 and 0.625) |
happy |
| test_bin_progress_boundary_0625 | Score exactly 0.625 | 0.625 |
0.75 |
boundary |
| test_bin_progress_mid_high | Score 0.7 | 0.7 |
0.75 |
happy |
| test_bin_progress_boundary_0875 | Score exactly 0.875 | 0.875 |
1.0 |
boundary |
| test_bin_progress_one | Score 1.0 | 1.0 |
1.0 |
boundary |
Run: uv run pytest tests/unit/test_reward.py -v -k "bin_progress"
_layer1_operational
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_layer1_successful_query | exec_ok + step_cost | action_type="QUERY", rows=[(1,)], error=None, new sql |
+0.02 - 0.005 = +0.015 (plus possible new_info) |
happy |
| test_layer1_successful_describe | exec_ok + step_cost | action_type="DESCRIBE", rows=..., error=None |
+0.02 - 0.005 = +0.015 |
happy |
| test_layer1_successful_sample | exec_ok + step_cost | action_type="SAMPLE", rows=..., error=None |
+0.02 - 0.005 = +0.015 |
happy |
| test_layer1_error_query | step_cost only | error="some error", rows=None |
-0.005 |
error |
| test_layer1_new_info_reward | First unique SQL | new sql hash, rows not None |
Includes +0.01 new_info |
happy |
| test_layer1_new_info_capped | Cap at 0.10 | Execute 11+ unique queries | cumulative_new_info_reward does not exceed 0.10 |
boundary |
| test_layer1_repeat_penalty | Same SQL twice | Submit same SQL hash twice | Second call includes -0.01 repeat |
error |
| test_layer1_repeat_no_exec_ok | Repeated query skips exec_ok | Same SQL hash as before | No +0.02 bonus |
edge |
| test_layer1_step_cost_always_applied | Step cost on every call | Any action | Always includes -0.005 |
invariant |
Run: uv run pytest tests/unit/test_reward.py -v -k "layer1"
_layer2_progress
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_layer2_perfect_match | All sub-metrics = 1.0 | rows == gold_rows (exact match) |
Binned 1.0, improvement from 0 = 1.0, scaled by 0.15 = 0.15 |
happy |
| test_layer2_no_improvement | Same binned score as best | Second identical query | 0.0 (no improvement over best_progress) |
edge |
| test_layer2_improvement_only | New bin > best | First query close, second closer | Reward = (new_bin - best_progress) * 0.15 |
happy |
| test_layer2_empty_gold_rows | Gold is empty | ctx.gold_rows = [] |
0.0 |
edge |
| test_layer2_weighted_average | Check weight formula | Known sub-metric values | 0.25*card + 0.50*overlap + 0.25*numeric |
happy |
| test_layer2_updates_best_progress | Mutates ctx | Query improves progress | ctx.best_progress updated to new bin |
happy |
| test_layer2_does_not_downgrade_best | Worse query after good | Good query then bad query | ctx.best_progress stays at higher value |
edge |
Run: uv run pytest tests/unit/test_reward.py -v -k "layer2"
compute_step_reward
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_compute_reward_query_success | Layer 1 + Layer 2 combined | QUERY with valid rows, gold_rows set | Sum of L1 + L2, clamped | happy |
| test_compute_reward_query_error | Layer 1 only, no Layer 2 | QUERY with error | -0.005 (step_cost only) |
error |
| test_compute_reward_describe | Layer 1 only, no Layer 2 | DESCRIBE action | L1 signal only | happy |
| test_compute_reward_sample | Layer 1 only, no Layer 2 | SAMPLE action | L1 signal only | happy |
| test_compute_reward_clamp_upper | Cumulative capped at +0.5 | Many successful improving queries | Cumulative never exceeds +0.5 |
boundary |
| test_compute_reward_clamp_lower | Cumulative floored at -0.2 | Many errors in a row | Cumulative never goes below -0.2 |
boundary |
| test_compute_reward_clamp_returns_delta | Step reward reflects clamp | Cumulative at 0.49, next step would add 0.05 | Returns 0.01 (clamped to 0.5) |
boundary |
| test_compute_reward_mutates_ctx | Updates tracking fields | Any call | ctx.cumulative_step_reward updated |
happy |
| test_compute_reward_layer2_skipped_for_describe | No progress calc for non-QUERY | DESCRIBE with rows | Layer 2 not called | happy |
| test_compute_reward_layer2_skipped_when_rows_none | No progress calc on error | QUERY, rows=None | Layer 2 not called | edge |
| test_compute_reward_layer2_skipped_empty_gold | No progress with empty gold | QUERY, gold_rows=[] | Layer 2 returns 0.0 | edge |
Run: uv run pytest tests/unit/test_reward.py -v -k "compute_step_reward"
2. Integration Tests
Flow: Primary Reward Computation Through step()
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | env.reset(seed=42) |
Episode created, gold_rows populated from gold SQL |
ctx.gold_rows is non-empty list of tuples |
| 2 | env.step(DESCRIBE employees) |
Step reward from Layer 1 only | observation.reward is None (non-terminal), but internal reward tracked |
| 3 | env.step(QUERY "SELECT COUNT(*) FROM employees") |
Layer 1 + Layer 2 computed | Progress score reflects cardinality/value/numeric comparison to gold |
| 4 | env.step(QUERY same_sql_again) |
Repeat penalty applied | Lower reward than step 3 |
| 5 | env.step(ANSWER correct_value) |
Terminal reward = 1.0 | observation.done=True, observation.reward=1.0 |
Run: uv run pytest tests/integration/test_reward_flow.py -v
Flow: SQL Error Handling
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | env.reset(seed=42) |
Episode active | Episode context initialized |
| 2 | env.step(QUERY "SELECT nonexistent FROM employees") |
Error caught, step_cost only | Reward is -0.005, Layer 2 not computed |
| 3 | env.step(QUERY valid_query) |
Normal reward resumes | Layer 1 + Layer 2 computed normally |
Run: uv run pytest tests/integration/test_reward_flow.py -v -k "error"
Flow: Empty Gold Rows
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Reset with question whose gold SQL returns empty | ctx.gold_rows == [] |
gold_rows stored as empty list |
| 2 | env.step(QUERY any_query) |
Layer 1 operates, Layer 2 returns 0.0 | Reward is Layer 1 signal only |
Run: uv run pytest tests/integration/test_reward_flow.py -v -k "empty_gold"
Flow: Repeated Query Detection
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | env.reset(seed=42) |
Fresh episode | ctx.query_hashes is empty |
| 2 | env.step(QUERY "SELECT 1") |
Hash added, no repeat penalty | ctx.query_hashes has 1 entry |
| 3 | env.step(QUERY "SELECT 1") |
Same hash detected, repeat penalty | Reward includes -0.01, no exec_ok |
| 4 | env.step(QUERY "SELECT 2") |
New hash, no repeat penalty | Normal reward, ctx.query_hashes has 2 entries |
Run: uv run pytest tests/integration/test_reward_flow.py -v -k "repeat"
3. API Tests
No API endpoints defined for F003. The reward system is internal server-side logic.
4. E2E Tests
Scenario: Random Exploration Yields ~0.1 Cumulative Reward
Setup: Environment reset with a known question. Actions: Execute 10 random DESCRIBE/SAMPLE/QUERY actions (no targeted queries). Expected: Cumulative step reward is approximately 0.1 (within [0.0, 0.2]).
Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "random_exploration"
Scenario: Targeted Queries Yield ~0.3 Cumulative Reward
Setup: Environment reset with a known question. Actions: Execute targeted queries that progressively approach the gold answer. Expected: Cumulative step reward is approximately 0.3 (within [0.2, 0.5]).
Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "targeted_queries"
Scenario: Correct Answer Yields ~1.3 Total Reward
Setup: Environment reset with a known question. Actions: Execute targeted queries, then ANSWER correctly. Expected: Total reward (cumulative step + terminal 1.0) is approximately 1.3 (within [1.0, 1.5]).
Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "correct_answer"
5. Edge Cases Checklist
- Null/None rows passed to compute_step_reward (SQL error case)
- Empty result rows from a valid query (e.g.,
SELECT * FROM t WHERE 1=0) - Single-row gold vs multi-row prediction
- Multi-row gold vs single-row prediction
- Gold rows with only non-numeric values (numeric_range returns 1.0)
- Gold rows with mixed numeric and string columns
- Very large numeric values (boundary for log-distance formula)
- Negative numeric values in gold or prediction
- Float vs integer comparison in numeric range (e.g.,
10vs10.0) - None/NULL values in result tuples (stringification for value_overlap)
- SQL strings that differ only by whitespace (hash should differ or normalize)
- Cumulative new_info exactly at cap (0.10) -- next unique query gets 0
- Cumulative step reward exactly at clamp boundary (-0.2 or +0.5)
- Layer 2 called with pred_rows and gold_rows of different column counts
- _bin_progress with values outside [0, 1] (e.g., negative or > 1.0 from rounding)
- Concurrent episodes (if supported) -- each has independent tracking fields
6. Evidence Requirements
| Category | Evidence Type | Example |
|---|---|---|
| Unit tests | pytest output | uv run pytest tests/unit/test_reward.py -v shows X passed |
| Integration | pytest output | uv run pytest tests/integration/test_reward_flow.py -v shows X passed |
| E2E | pytest output | uv run pytest tests/e2e/test_reward_scenarios.py -v shows X passed |
| Reward calibration | Logged values | Random exploration ~0.1, targeted ~0.3, correct ~1.3 |
| Existing tests | pytest output | uv run pytest tests/test_smoke.py -v still passes (no regressions) |