sql_env / specs /F002-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Verification Specification

Feature: F002 Generated from: specs/F002-VERIFICATION_INPUT.json Generated: 2026-03-27


1. Unit Tests

verify_answer (dispatcher)

Test Description Input Expected Category
test_verify_integer_exact_match Dispatches to integer comparer for exact match predicted="42", gold="42", answer_type="integer" True happy
test_verify_float_within_tolerance Dispatches to float comparer within 1% predicted="3.14", gold="3.15", answer_type="float" True happy
test_verify_string_case_insensitive Dispatches to string comparer ignoring case predicted="Alice", gold="alice", answer_type="string" True happy
test_verify_list_order_insensitive Dispatches to list comparer ignoring order predicted="a, b", gold="b, a", answer_type="list" True happy
test_verify_none_type_falls_back_to_string Falls back to string comparison when answer_type is None predicted="hello", gold="hello", answer_type=None True fallback
test_verify_unknown_type_falls_back_to_string Falls back to string comparison for unrecognized type predicted="foo", gold="foo", answer_type="table" True fallback
test_verify_empty_predicted_returns_false Empty string after strip returns False immediately predicted=" ", gold="42", answer_type="integer" False edge
test_verify_none_predicted_returns_false Handles None-like empty input predicted="", gold="42", answer_type=None False edge

Run: uv run pytest tests/test_verifier.py -v -k "test_verify"


_compare_integer

Test Description Input Expected Category
test_int_exact_match Both sides are integers predicted="25", gold="25" True happy
test_int_from_float_string Coerces "25.0" via int(float(x)) predicted="25.0", gold="25" True happy
test_int_mismatch Different integers predicted="24", gold="25" False happy
test_int_negative_values Negative integers match predicted="-3", gold="-3" True happy
test_int_negative_mismatch Negative vs positive predicted="-3", gold="3" False happy
test_int_zero Zero matches zero predicted="0", gold="0" True edge
test_int_large_value Large integers predicted="999999999", gold="999999999" True edge
test_int_non_numeric_returns_false Non-numeric predicted returns False predicted="abc", gold="25" False error
test_int_non_numeric_gold_returns_false Non-numeric gold returns False predicted="25", gold="abc" False error
test_int_empty_string_returns_false Empty string returns False predicted="", gold="25" False edge
test_int_whitespace_only_returns_false Whitespace-only returns False predicted=" ", gold="25" False edge
test_int_float_truncation "25.9" coerced to 25 matches gold "25" predicted="25.9", gold="25" True edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_integer"


_compare_float

Test Description Input Expected Category
test_float_exact_match Identical float strings predicted="3.14", gold="3.14" True happy
test_float_within_1pct_tolerance Difference within 1% predicted="100.5", gold="100.0" True happy
test_float_outside_1pct_tolerance Difference exceeds 1% predicted="102.0", gold="100.0" False happy
test_float_boundary_exactly_1pct Exactly at 1% boundary predicted="101.0", gold="100.0" True edge
test_float_just_over_1pct Just past 1% boundary predicted="101.01", gold="100.0" False edge
test_float_gold_zero_uses_absolute_tolerance Gold is 0, uses 1e-9 absolute predicted="0.0000000001", gold="0" True edge
test_float_gold_zero_fails_large_diff Gold is 0, predicted too far predicted="0.001", gold="0" False edge
test_float_negative_values Negative floats within tolerance predicted="-99.5", gold="-100.0" True happy
test_float_non_numeric_returns_false Non-numeric predicted predicted="abc", gold="3.14" False error
test_float_non_numeric_gold_returns_false Non-numeric gold predicted="3.14", gold="abc" False error
test_float_integer_strings Integer strings as floats predicted="42", gold="42" True edge
test_float_very_small_values Very small but non-zero predicted="0.0001", gold="0.0001" True edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_float"


_compare_string

Test Description Input Expected Category
test_string_exact_match Identical strings predicted="Alice", gold="Alice" True happy
test_string_case_insensitive Different casing predicted="ALICE", gold="alice" True happy
test_string_whitespace_normalized Leading/trailing/extra whitespace predicted=" Alice Bob ", gold="Alice Bob" True happy
test_string_mismatch Different strings predicted="Alice", gold="Bob" False happy
test_string_empty_both Both empty predicted="", gold="" True edge
test_string_unicode Unicode characters predicted="cafe\u0301", gold="cafe\u0301" True edge
test_string_special_characters Special characters match predicted="O'Brien", gold="O'Brien" True edge
test_string_numeric_as_string Numbers compared as strings predicted="42", gold="42" True edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_string"


_compare_list

Test Description Input Expected Category
test_list_same_order Identical lists predicted="a, b, c", gold="a, b, c" True happy
test_list_different_order Reordered elements predicted="c, a, b", gold="a, b, c" True happy
test_list_mismatch Different elements predicted="a, b, d", gold="a, b, c" False happy
test_list_extra_element Predicted has extra predicted="a, b, c, d", gold="a, b, c" False happy
test_list_missing_element Predicted is missing one predicted="a, b", gold="a, b, c" False happy
test_list_duplicates_matter Duplicates in one side predicted="a, a, b", gold="a, b" Defined by impl edge
test_list_with_gold_rows Uses gold_rows when provided predicted="a, b", gold="...", gold_rows=[("a",), ("b",)] True happy
test_list_gold_rows_none_fallback Falls back to string parsing when gold_rows is None predicted="a, b", gold="a, b", gold_rows=None True fallback
test_list_empty Both sides empty predicted="", gold="" Defined by impl edge
test_list_single_element Single element lists predicted="only", gold="only" True edge
test_list_whitespace_in_elements Elements with whitespace predicted=" a , b ", gold="a, b" True edge
test_list_case_sensitivity Case handling in list elements predicted="Alice, Bob", gold="alice, bob" Defined by impl edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_list"


EpisodeContext.gold_rows field

Test Description Input Expected Category
test_episode_context_gold_rows_default gold_rows defaults to None EpisodeContext(...) gold_rows is None happy
test_episode_context_gold_rows_set gold_rows can be set to list of tuples EpisodeContext(..., gold_rows=[(1,), (2,)]) gold_rows == [(1,), (2,)] happy
test_episode_context_gold_rows_empty_list gold_rows can be empty list EpisodeContext(..., gold_rows=[]) gold_rows == [] edge

Run: uv run pytest tests/test_verifier.py -v -k "episode_context"


2. Integration Tests

Flow: Primary answer verification through step()

Step Action Expected Verification
1 Agent sends ANSWER action with value string step() dispatches to _handle_answer env.step(SQLAction(action_type="ANSWER", argument=value))
2 _handle_answer calls verify_answer with predicted, gold, answer_type, gold_rows verify_answer receives all four arguments Correct reward returned in observation
3 verify_answer dispatches to type-specific comparer Correct comparer chosen based on answer_type observation.reward == 1.0 for correct answers
4 Boolean result maps to reward True -> 1.0, False -> 0.0 observation.done is True

Flow: Integer answer through full environment

Step Action Expected Verification
1 Reset environment with question that has answer_type="integer" Episode created with integer question observation.done is False
2 Submit ANSWER with correct integer (possibly as float string) verify_answer coerces and matches observation.reward == 1.0

Flow: Float answer through full environment

Step Action Expected Verification
1 Reset with question that has answer_type="float" Episode created with float question observation.done is False
2 Submit ANSWER within 1% tolerance verify_answer accepts within tolerance observation.reward == 1.0

Flow: String answer through full environment

Step Action Expected Verification
1 Reset with question that has answer_type="string" Episode created with string question observation.done is False
2 Submit ANSWER with different casing/whitespace verify_answer normalizes and matches observation.reward == 1.0

Flow: List answer through full environment

Step Action Expected Verification
1 Reset with question that has answer_type="list" Episode created with list question, gold_rows populated observation.done is False
2 Submit ANSWER with reordered list verify_answer compares as sets observation.reward == 1.0

Flow: Fallback for missing answer_type

Step Action Expected Verification
1 Reset with question that has answer_type=None or missing Episode created without explicit type observation.done is False
2 Submit ANSWER matching gold exactly (modulo case/whitespace) Falls back to string comparison observation.reward == 1.0

Flow: Type coercion failure

Step Action Expected Verification
1 Reset with question that has answer_type="integer" Episode created with integer question observation.done is False
2 Submit ANSWER with non-numeric string _compare_integer catches ValueError, returns False observation.reward == 0.0

Run: uv run pytest tests/test_verifier_integration.py -v


3. API Tests

No API endpoints are defined for F002. Answer verification is an internal server-side function called within the step() handler. API-level testing is covered by the integration tests above (testing through the step() interface).


4. E2E Tests

Scenario: Correct integer answer accepted

Setup: Environment initialized with a question whose gold answer is "25" and answer_type is "integer". Actions: Agent submits ANSWER "25". Expected: observation.done is True, observation.reward is 1.0.

Scenario: Correct float answer accepted within tolerance

Setup: Environment initialized with a question whose gold answer is "3.14159" and answer_type is "float". Actions: Agent submits ANSWER "3.14". Expected: observation.done is True, observation.reward is 1.0 (within 1% tolerance).

Scenario: Correct string answer accepted case-insensitively

Setup: Environment initialized with a question whose gold answer is "Engineering" and answer_type is "string". Actions: Agent submits ANSWER "engineering". Expected: observation.done is True, observation.reward is 1.0.

Scenario: Correct list answer accepted order-insensitively

Setup: Environment initialized with a question whose gold answer is "alice, bob, charlie" and answer_type is "list". Actions: Agent submits ANSWER "charlie, alice, bob". Expected: observation.done is True, observation.reward is 1.0.

Scenario: Wrong answer rejected

Setup: Environment initialized with any question. Actions: Agent submits ANSWER with clearly wrong value. Expected: observation.done is True, observation.reward is 0.0.

Scenario: Backward compatibility -- no answer_type field

Setup: Environment initialized with a legacy question record that has no answer_type (or answer_type is None). Actions: Agent submits ANSWER matching gold answer exactly. Expected: observation.done is True, observation.reward is 1.0 (string fallback used).

Run: uv run pytest tests/test_smoke.py tests/test_verifier_integration.py -v


5. Edge Cases Checklist

  • Empty string predicted (after strip) returns False immediately
  • Whitespace-only predicted returns False
  • Non-numeric string for integer comparison returns False (ValueError caught)
  • Non-numeric string for float comparison returns False (ValueError caught)
  • Gold value of "0" for float comparison uses absolute tolerance 1e-9
  • Float boundary at exactly 1% tolerance (should pass)
  • Float just over 1% tolerance (should fail)
  • Integer coercion via int(float(x)) handles "25.0" -> 25
  • Integer coercion truncates "25.9" -> 25
  • List with gold_rows=None falls back to string parsing
  • List with gold_rows provided uses structured comparison
  • answer_type=None dispatches to string comparison
  • Unknown answer_type (e.g., "table", "unknown") dispatches to string comparison
  • Very large integer values (MAX_INT range)
  • Unicode characters in string comparison
  • Special characters in string comparison (quotes, apostrophes)
  • Negative numbers for integer and float comparisons
  • List with duplicate elements
  • Single-element list
  • Mixed whitespace in list elements

6. Evidence Requirements

Category Evidence Type Example
Unit tests pytest output uv run pytest tests/test_verifier.py -v -- X passed
Integration pytest output uv run pytest tests/test_verifier_integration.py -v -- X passed
E2E pytest output via smoke tests uv run pytest tests/test_smoke.py -v -- answer tests pass
Backward compat pytest output Existing test_answer_ends_episode_without_budget_decrement still passes