Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F002-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

15 kB

Verification Specification

Feature: F002 Generated from: specs/F002-VERIFICATION_INPUT.json Generated: 2026-03-27

1. Unit Tests

verify_answer (dispatcher)

Test	Description	Input	Expected	Category
test_verify_integer_exact_match	Dispatches to integer comparer for exact match	`predicted="42", gold="42", answer_type="integer"`	`True`	happy
test_verify_float_within_tolerance	Dispatches to float comparer within 1%	`predicted="3.14", gold="3.15", answer_type="float"`	`True`	happy
test_verify_string_case_insensitive	Dispatches to string comparer ignoring case	`predicted="Alice", gold="alice", answer_type="string"`	`True`	happy
test_verify_list_order_insensitive	Dispatches to list comparer ignoring order	`predicted="a, b", gold="b, a", answer_type="list"`	`True`	happy
test_verify_none_type_falls_back_to_string	Falls back to string comparison when answer_type is None	`predicted="hello", gold="hello", answer_type=None`	`True`	fallback
test_verify_unknown_type_falls_back_to_string	Falls back to string comparison for unrecognized type	`predicted="foo", gold="foo", answer_type="table"`	`True`	fallback
test_verify_empty_predicted_returns_false	Empty string after strip returns False immediately	`predicted=" ", gold="42", answer_type="integer"`	`False`	edge
test_verify_none_predicted_returns_false	Handles None-like empty input	`predicted="", gold="42", answer_type=None`	`False`	edge

Run: uv run pytest tests/test_verifier.py -v -k "test_verify"

_compare_integer

Test	Description	Input	Expected	Category
test_int_exact_match	Both sides are integers	`predicted="25", gold="25"`	`True`	happy
test_int_from_float_string	Coerces "25.0" via int(float(x))	`predicted="25.0", gold="25"`	`True`	happy
test_int_mismatch	Different integers	`predicted="24", gold="25"`	`False`	happy
test_int_negative_values	Negative integers match	`predicted="-3", gold="-3"`	`True`	happy
test_int_negative_mismatch	Negative vs positive	`predicted="-3", gold="3"`	`False`	happy
test_int_zero	Zero matches zero	`predicted="0", gold="0"`	`True`	edge
test_int_large_value	Large integers	`predicted="999999999", gold="999999999"`	`True`	edge
test_int_non_numeric_returns_false	Non-numeric predicted returns False	`predicted="abc", gold="25"`	`False`	error
test_int_non_numeric_gold_returns_false	Non-numeric gold returns False	`predicted="25", gold="abc"`	`False`	error
test_int_empty_string_returns_false	Empty string returns False	`predicted="", gold="25"`	`False`	edge
test_int_whitespace_only_returns_false	Whitespace-only returns False	`predicted=" ", gold="25"`	`False`	edge
test_int_float_truncation	"25.9" coerced to 25 matches gold "25"	`predicted="25.9", gold="25"`	`True`	edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_integer"

_compare_float

Test	Description	Input	Expected	Category
test_float_exact_match	Identical float strings	`predicted="3.14", gold="3.14"`	`True`	happy
test_float_within_1pct_tolerance	Difference within 1%	`predicted="100.5", gold="100.0"`	`True`	happy
test_float_outside_1pct_tolerance	Difference exceeds 1%	`predicted="102.0", gold="100.0"`	`False`	happy
test_float_boundary_exactly_1pct	Exactly at 1% boundary	`predicted="101.0", gold="100.0"`	`True`	edge
test_float_just_over_1pct	Just past 1% boundary	`predicted="101.01", gold="100.0"`	`False`	edge
test_float_gold_zero_uses_absolute_tolerance	Gold is 0, uses 1e-9 absolute	`predicted="0.0000000001", gold="0"`	`True`	edge
test_float_gold_zero_fails_large_diff	Gold is 0, predicted too far	`predicted="0.001", gold="0"`	`False`	edge
test_float_negative_values	Negative floats within tolerance	`predicted="-99.5", gold="-100.0"`	`True`	happy
test_float_non_numeric_returns_false	Non-numeric predicted	`predicted="abc", gold="3.14"`	`False`	error
test_float_non_numeric_gold_returns_false	Non-numeric gold	`predicted="3.14", gold="abc"`	`False`	error
test_float_integer_strings	Integer strings as floats	`predicted="42", gold="42"`	`True`	edge
test_float_very_small_values	Very small but non-zero	`predicted="0.0001", gold="0.0001"`	`True`	edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_float"

_compare_string

Test	Description	Input	Expected	Category
test_string_exact_match	Identical strings	`predicted="Alice", gold="Alice"`	`True`	happy
test_string_case_insensitive	Different casing	`predicted="ALICE", gold="alice"`	`True`	happy
test_string_whitespace_normalized	Leading/trailing/extra whitespace	`predicted=" Alice Bob ", gold="Alice Bob"`	`True`	happy
test_string_mismatch	Different strings	`predicted="Alice", gold="Bob"`	`False`	happy
test_string_empty_both	Both empty	`predicted="", gold=""`	`True`	edge
test_string_unicode	Unicode characters	`predicted="cafe\u0301", gold="cafe\u0301"`	`True`	edge
test_string_special_characters	Special characters match	`predicted="O'Brien", gold="O'Brien"`	`True`	edge
test_string_numeric_as_string	Numbers compared as strings	`predicted="42", gold="42"`	`True`	edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_string"

_compare_list

Test	Description	Input	Expected	Category
test_list_same_order	Identical lists	`predicted="a, b, c", gold="a, b, c"`	`True`	happy
test_list_different_order	Reordered elements	`predicted="c, a, b", gold="a, b, c"`	`True`	happy
test_list_mismatch	Different elements	`predicted="a, b, d", gold="a, b, c"`	`False`	happy
test_list_extra_element	Predicted has extra	`predicted="a, b, c, d", gold="a, b, c"`	`False`	happy
test_list_missing_element	Predicted is missing one	`predicted="a, b", gold="a, b, c"`	`False`	happy
test_list_duplicates_matter	Duplicates in one side	`predicted="a, a, b", gold="a, b"`	Defined by impl	edge
test_list_with_gold_rows	Uses gold_rows when provided	`predicted="a, b", gold="...", gold_rows=[("a",), ("b",)]`	`True`	happy
test_list_gold_rows_none_fallback	Falls back to string parsing when gold_rows is None	`predicted="a, b", gold="a, b", gold_rows=None`	`True`	fallback
test_list_empty	Both sides empty	`predicted="", gold=""`	Defined by impl	edge
test_list_single_element	Single element lists	`predicted="only", gold="only"`	`True`	edge
test_list_whitespace_in_elements	Elements with whitespace	`predicted=" a , b ", gold="a, b"`	`True`	edge
test_list_case_sensitivity	Case handling in list elements	`predicted="Alice, Bob", gold="alice, bob"`	Defined by impl	edge

Run: uv run pytest tests/test_verifier.py -v -k "_compare_list"

EpisodeContext.gold_rows field

Test	Description	Input	Expected	Category
test_episode_context_gold_rows_default	gold_rows defaults to None	`EpisodeContext(...)`	`gold_rows is None`	happy
test_episode_context_gold_rows_set	gold_rows can be set to list of tuples	`EpisodeContext(..., gold_rows=[(1,), (2,)])`	`gold_rows == [(1,), (2,)]`	happy
test_episode_context_gold_rows_empty_list	gold_rows can be empty list	`EpisodeContext(..., gold_rows=[])`	`gold_rows == []`	edge

Run: uv run pytest tests/test_verifier.py -v -k "episode_context"

2. Integration Tests

Flow: Primary answer verification through step()

Step	Action	Expected	Verification
1	Agent sends ANSWER action with value string	step() dispatches to _handle_answer	`env.step(SQLAction(action_type="ANSWER", argument=value))`
2	_handle_answer calls verify_answer with predicted, gold, answer_type, gold_rows	verify_answer receives all four arguments	Correct reward returned in observation
3	verify_answer dispatches to type-specific comparer	Correct comparer chosen based on answer_type	`observation.reward == 1.0` for correct answers
4	Boolean result maps to reward	True -> 1.0, False -> 0.0	`observation.done is True`

Flow: Integer answer through full environment

Step	Action	Expected	Verification
1	Reset environment with question that has answer_type="integer"	Episode created with integer question	`observation.done is False`
2	Submit ANSWER with correct integer (possibly as float string)	verify_answer coerces and matches	`observation.reward == 1.0`

Flow: Float answer through full environment

Step	Action	Expected	Verification
1	Reset with question that has answer_type="float"	Episode created with float question	`observation.done is False`
2	Submit ANSWER within 1% tolerance	verify_answer accepts within tolerance	`observation.reward == 1.0`

Flow: String answer through full environment

Step	Action	Expected	Verification
1	Reset with question that has answer_type="string"	Episode created with string question	`observation.done is False`
2	Submit ANSWER with different casing/whitespace	verify_answer normalizes and matches	`observation.reward == 1.0`

Flow: List answer through full environment

Step	Action	Expected	Verification
1	Reset with question that has answer_type="list"	Episode created with list question, gold_rows populated	`observation.done is False`
2	Submit ANSWER with reordered list	verify_answer compares as sets	`observation.reward == 1.0`

Flow: Fallback for missing answer_type

Step	Action	Expected	Verification
1	Reset with question that has answer_type=None or missing	Episode created without explicit type	`observation.done is False`
2	Submit ANSWER matching gold exactly (modulo case/whitespace)	Falls back to string comparison	`observation.reward == 1.0`

Flow: Type coercion failure

Step	Action	Expected	Verification
1	Reset with question that has answer_type="integer"	Episode created with integer question	`observation.done is False`
2	Submit ANSWER with non-numeric string	_compare_integer catches ValueError, returns False	`observation.reward == 0.0`

Run: uv run pytest tests/test_verifier_integration.py -v

3. API Tests

No API endpoints are defined for F002. Answer verification is an internal server-side function called within the step() handler. API-level testing is covered by the integration tests above (testing through the step() interface).

4. E2E Tests

Scenario: Correct integer answer accepted

Setup: Environment initialized with a question whose gold answer is "25" and answer_type is "integer". Actions: Agent submits ANSWER "25". Expected: observation.done is True, observation.reward is 1.0.

Scenario: Correct float answer accepted within tolerance

Setup: Environment initialized with a question whose gold answer is "3.14159" and answer_type is "float". Actions: Agent submits ANSWER "3.14". Expected: observation.done is True, observation.reward is 1.0 (within 1% tolerance).

Scenario: Correct string answer accepted case-insensitively

Setup: Environment initialized with a question whose gold answer is "Engineering" and answer_type is "string". Actions: Agent submits ANSWER "engineering". Expected: observation.done is True, observation.reward is 1.0.

Scenario: Correct list answer accepted order-insensitively

Setup: Environment initialized with a question whose gold answer is "alice, bob, charlie" and answer_type is "list". Actions: Agent submits ANSWER "charlie, alice, bob". Expected: observation.done is True, observation.reward is 1.0.

Scenario: Wrong answer rejected

Setup: Environment initialized with any question. Actions: Agent submits ANSWER with clearly wrong value. Expected: observation.done is True, observation.reward is 0.0.

Scenario: Backward compatibility -- no answer_type field

Setup: Environment initialized with a legacy question record that has no answer_type (or answer_type is None). Actions: Agent submits ANSWER matching gold answer exactly. Expected: observation.done is True, observation.reward is 1.0 (string fallback used).

Run: uv run pytest tests/test_smoke.py tests/test_verifier_integration.py -v

5. Edge Cases Checklist

Empty string predicted (after strip) returns False immediately
Whitespace-only predicted returns False
Non-numeric string for integer comparison returns False (ValueError caught)
Non-numeric string for float comparison returns False (ValueError caught)
Gold value of "0" for float comparison uses absolute tolerance 1e-9
Float boundary at exactly 1% tolerance (should pass)
Float just over 1% tolerance (should fail)
Integer coercion via int(float(x)) handles "25.0" -> 25
Integer coercion truncates "25.9" -> 25
List with gold_rows=None falls back to string parsing
List with gold_rows provided uses structured comparison
answer_type=None dispatches to string comparison
Unknown answer_type (e.g., "table", "unknown") dispatches to string comparison
Very large integer values (MAX_INT range)
Unicode characters in string comparison
Special characters in string comparison (quotes, apostrophes)
Negative numbers for integer and float comparisons
List with duplicate elements
Single-element list
Mixed whitespace in list elements

6. Evidence Requirements

Category	Evidence Type	Example
Unit tests	pytest output	`uv run pytest tests/test_verifier.py -v` -- `X passed`
Integration	pytest output	`uv run pytest tests/test_verifier_integration.py -v` -- `X passed`
E2E	pytest output via smoke tests	`uv run pytest tests/test_smoke.py -v` -- answer tests pass
Backward compat	pytest output	Existing test_answer_ends_episode_without_budget_decrement still passes