Verification Specification
Feature: F002 Generated from: specs/F002-VERIFICATION_INPUT.json Generated: 2026-03-27
1. Unit Tests
verify_answer (dispatcher)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_verify_integer_exact_match | Dispatches to integer comparer for exact match | predicted="42", gold="42", answer_type="integer" |
True |
happy |
| test_verify_float_within_tolerance | Dispatches to float comparer within 1% | predicted="3.14", gold="3.15", answer_type="float" |
True |
happy |
| test_verify_string_case_insensitive | Dispatches to string comparer ignoring case | predicted="Alice", gold="alice", answer_type="string" |
True |
happy |
| test_verify_list_order_insensitive | Dispatches to list comparer ignoring order | predicted="a, b", gold="b, a", answer_type="list" |
True |
happy |
| test_verify_none_type_falls_back_to_string | Falls back to string comparison when answer_type is None | predicted="hello", gold="hello", answer_type=None |
True |
fallback |
| test_verify_unknown_type_falls_back_to_string | Falls back to string comparison for unrecognized type | predicted="foo", gold="foo", answer_type="table" |
True |
fallback |
| test_verify_empty_predicted_returns_false | Empty string after strip returns False immediately | predicted=" ", gold="42", answer_type="integer" |
False |
edge |
| test_verify_none_predicted_returns_false | Handles None-like empty input | predicted="", gold="42", answer_type=None |
False |
edge |
Run: uv run pytest tests/test_verifier.py -v -k "test_verify"
_compare_integer
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_int_exact_match | Both sides are integers | predicted="25", gold="25" |
True |
happy |
| test_int_from_float_string | Coerces "25.0" via int(float(x)) | predicted="25.0", gold="25" |
True |
happy |
| test_int_mismatch | Different integers | predicted="24", gold="25" |
False |
happy |
| test_int_negative_values | Negative integers match | predicted="-3", gold="-3" |
True |
happy |
| test_int_negative_mismatch | Negative vs positive | predicted="-3", gold="3" |
False |
happy |
| test_int_zero | Zero matches zero | predicted="0", gold="0" |
True |
edge |
| test_int_large_value | Large integers | predicted="999999999", gold="999999999" |
True |
edge |
| test_int_non_numeric_returns_false | Non-numeric predicted returns False | predicted="abc", gold="25" |
False |
error |
| test_int_non_numeric_gold_returns_false | Non-numeric gold returns False | predicted="25", gold="abc" |
False |
error |
| test_int_empty_string_returns_false | Empty string returns False | predicted="", gold="25" |
False |
edge |
| test_int_whitespace_only_returns_false | Whitespace-only returns False | predicted=" ", gold="25" |
False |
edge |
| test_int_float_truncation | "25.9" coerced to 25 matches gold "25" | predicted="25.9", gold="25" |
True |
edge |
Run: uv run pytest tests/test_verifier.py -v -k "_compare_integer"
_compare_float
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_float_exact_match | Identical float strings | predicted="3.14", gold="3.14" |
True |
happy |
| test_float_within_1pct_tolerance | Difference within 1% | predicted="100.5", gold="100.0" |
True |
happy |
| test_float_outside_1pct_tolerance | Difference exceeds 1% | predicted="102.0", gold="100.0" |
False |
happy |
| test_float_boundary_exactly_1pct | Exactly at 1% boundary | predicted="101.0", gold="100.0" |
True |
edge |
| test_float_just_over_1pct | Just past 1% boundary | predicted="101.01", gold="100.0" |
False |
edge |
| test_float_gold_zero_uses_absolute_tolerance | Gold is 0, uses 1e-9 absolute | predicted="0.0000000001", gold="0" |
True |
edge |
| test_float_gold_zero_fails_large_diff | Gold is 0, predicted too far | predicted="0.001", gold="0" |
False |
edge |
| test_float_negative_values | Negative floats within tolerance | predicted="-99.5", gold="-100.0" |
True |
happy |
| test_float_non_numeric_returns_false | Non-numeric predicted | predicted="abc", gold="3.14" |
False |
error |
| test_float_non_numeric_gold_returns_false | Non-numeric gold | predicted="3.14", gold="abc" |
False |
error |
| test_float_integer_strings | Integer strings as floats | predicted="42", gold="42" |
True |
edge |
| test_float_very_small_values | Very small but non-zero | predicted="0.0001", gold="0.0001" |
True |
edge |
Run: uv run pytest tests/test_verifier.py -v -k "_compare_float"
_compare_string
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_string_exact_match | Identical strings | predicted="Alice", gold="Alice" |
True |
happy |
| test_string_case_insensitive | Different casing | predicted="ALICE", gold="alice" |
True |
happy |
| test_string_whitespace_normalized | Leading/trailing/extra whitespace | predicted=" Alice Bob ", gold="Alice Bob" |
True |
happy |
| test_string_mismatch | Different strings | predicted="Alice", gold="Bob" |
False |
happy |
| test_string_empty_both | Both empty | predicted="", gold="" |
True |
edge |
| test_string_unicode | Unicode characters | predicted="cafe\u0301", gold="cafe\u0301" |
True |
edge |
| test_string_special_characters | Special characters match | predicted="O'Brien", gold="O'Brien" |
True |
edge |
| test_string_numeric_as_string | Numbers compared as strings | predicted="42", gold="42" |
True |
edge |
Run: uv run pytest tests/test_verifier.py -v -k "_compare_string"
_compare_list
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_list_same_order | Identical lists | predicted="a, b, c", gold="a, b, c" |
True |
happy |
| test_list_different_order | Reordered elements | predicted="c, a, b", gold="a, b, c" |
True |
happy |
| test_list_mismatch | Different elements | predicted="a, b, d", gold="a, b, c" |
False |
happy |
| test_list_extra_element | Predicted has extra | predicted="a, b, c, d", gold="a, b, c" |
False |
happy |
| test_list_missing_element | Predicted is missing one | predicted="a, b", gold="a, b, c" |
False |
happy |
| test_list_duplicates_matter | Duplicates in one side | predicted="a, a, b", gold="a, b" |
Defined by impl | edge |
| test_list_with_gold_rows | Uses gold_rows when provided | predicted="a, b", gold="...", gold_rows=[("a",), ("b",)] |
True |
happy |
| test_list_gold_rows_none_fallback | Falls back to string parsing when gold_rows is None | predicted="a, b", gold="a, b", gold_rows=None |
True |
fallback |
| test_list_empty | Both sides empty | predicted="", gold="" |
Defined by impl | edge |
| test_list_single_element | Single element lists | predicted="only", gold="only" |
True |
edge |
| test_list_whitespace_in_elements | Elements with whitespace | predicted=" a , b ", gold="a, b" |
True |
edge |
| test_list_case_sensitivity | Case handling in list elements | predicted="Alice, Bob", gold="alice, bob" |
Defined by impl | edge |
Run: uv run pytest tests/test_verifier.py -v -k "_compare_list"
EpisodeContext.gold_rows field
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_episode_context_gold_rows_default | gold_rows defaults to None | EpisodeContext(...) |
gold_rows is None |
happy |
| test_episode_context_gold_rows_set | gold_rows can be set to list of tuples | EpisodeContext(..., gold_rows=[(1,), (2,)]) |
gold_rows == [(1,), (2,)] |
happy |
| test_episode_context_gold_rows_empty_list | gold_rows can be empty list | EpisodeContext(..., gold_rows=[]) |
gold_rows == [] |
edge |
Run: uv run pytest tests/test_verifier.py -v -k "episode_context"
2. Integration Tests
Flow: Primary answer verification through step()
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Agent sends ANSWER action with value string | step() dispatches to _handle_answer | env.step(SQLAction(action_type="ANSWER", argument=value)) |
| 2 | _handle_answer calls verify_answer with predicted, gold, answer_type, gold_rows | verify_answer receives all four arguments | Correct reward returned in observation |
| 3 | verify_answer dispatches to type-specific comparer | Correct comparer chosen based on answer_type | observation.reward == 1.0 for correct answers |
| 4 | Boolean result maps to reward | True -> 1.0, False -> 0.0 | observation.done is True |
Flow: Integer answer through full environment
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Reset environment with question that has answer_type="integer" | Episode created with integer question | observation.done is False |
| 2 | Submit ANSWER with correct integer (possibly as float string) | verify_answer coerces and matches | observation.reward == 1.0 |
Flow: Float answer through full environment
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Reset with question that has answer_type="float" | Episode created with float question | observation.done is False |
| 2 | Submit ANSWER within 1% tolerance | verify_answer accepts within tolerance | observation.reward == 1.0 |
Flow: String answer through full environment
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Reset with question that has answer_type="string" | Episode created with string question | observation.done is False |
| 2 | Submit ANSWER with different casing/whitespace | verify_answer normalizes and matches | observation.reward == 1.0 |
Flow: List answer through full environment
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Reset with question that has answer_type="list" | Episode created with list question, gold_rows populated | observation.done is False |
| 2 | Submit ANSWER with reordered list | verify_answer compares as sets | observation.reward == 1.0 |
Flow: Fallback for missing answer_type
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Reset with question that has answer_type=None or missing | Episode created without explicit type | observation.done is False |
| 2 | Submit ANSWER matching gold exactly (modulo case/whitespace) | Falls back to string comparison | observation.reward == 1.0 |
Flow: Type coercion failure
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Reset with question that has answer_type="integer" | Episode created with integer question | observation.done is False |
| 2 | Submit ANSWER with non-numeric string | _compare_integer catches ValueError, returns False | observation.reward == 0.0 |
Run: uv run pytest tests/test_verifier_integration.py -v
3. API Tests
No API endpoints are defined for F002. Answer verification is an internal server-side function called within the step() handler. API-level testing is covered by the integration tests above (testing through the step() interface).
4. E2E Tests
Scenario: Correct integer answer accepted
Setup: Environment initialized with a question whose gold answer is "25" and answer_type is "integer". Actions: Agent submits ANSWER "25". Expected: observation.done is True, observation.reward is 1.0.
Scenario: Correct float answer accepted within tolerance
Setup: Environment initialized with a question whose gold answer is "3.14159" and answer_type is "float". Actions: Agent submits ANSWER "3.14". Expected: observation.done is True, observation.reward is 1.0 (within 1% tolerance).
Scenario: Correct string answer accepted case-insensitively
Setup: Environment initialized with a question whose gold answer is "Engineering" and answer_type is "string". Actions: Agent submits ANSWER "engineering". Expected: observation.done is True, observation.reward is 1.0.
Scenario: Correct list answer accepted order-insensitively
Setup: Environment initialized with a question whose gold answer is "alice, bob, charlie" and answer_type is "list". Actions: Agent submits ANSWER "charlie, alice, bob". Expected: observation.done is True, observation.reward is 1.0.
Scenario: Wrong answer rejected
Setup: Environment initialized with any question. Actions: Agent submits ANSWER with clearly wrong value. Expected: observation.done is True, observation.reward is 0.0.
Scenario: Backward compatibility -- no answer_type field
Setup: Environment initialized with a legacy question record that has no answer_type (or answer_type is None). Actions: Agent submits ANSWER matching gold answer exactly. Expected: observation.done is True, observation.reward is 1.0 (string fallback used).
Run: uv run pytest tests/test_smoke.py tests/test_verifier_integration.py -v
5. Edge Cases Checklist
- Empty string predicted (after strip) returns False immediately
- Whitespace-only predicted returns False
- Non-numeric string for integer comparison returns False (ValueError caught)
- Non-numeric string for float comparison returns False (ValueError caught)
- Gold value of "0" for float comparison uses absolute tolerance 1e-9
- Float boundary at exactly 1% tolerance (should pass)
- Float just over 1% tolerance (should fail)
- Integer coercion via int(float(x)) handles "25.0" -> 25
- Integer coercion truncates "25.9" -> 25
- List with gold_rows=None falls back to string parsing
- List with gold_rows provided uses structured comparison
- answer_type=None dispatches to string comparison
- Unknown answer_type (e.g., "table", "unknown") dispatches to string comparison
- Very large integer values (MAX_INT range)
- Unicode characters in string comparison
- Special characters in string comparison (quotes, apostrophes)
- Negative numbers for integer and float comparisons
- List with duplicate elements
- Single-element list
- Mixed whitespace in list elements
6. Evidence Requirements
| Category | Evidence Type | Example |
|---|---|---|
| Unit tests | pytest output | uv run pytest tests/test_verifier.py -v -- X passed |
| Integration | pytest output | uv run pytest tests/test_verifier_integration.py -v -- X passed |
| E2E | pytest output via smoke tests | uv run pytest tests/test_smoke.py -v -- answer tests pass |
| Backward compat | pytest output | Existing test_answer_ends_episode_without_budget_decrement still passes |