# Feature Demo: F002 — Answer Verification > **Generated:** 2026-03-27T22:37:50Z > **Context source:** spec + discovery only (implementation not read) > **Feature entry:** [FEATURES.json #F002](FEATURES.json) --- ## What This Feature Does When an agent submits an `ANSWER`, this feature makes the final pass/fail decision robust to common formatting and representation differences. From a user perspective, it reduces false negatives where the agent is semantically correct but uses a different format (for example numeric formatting differences, casing differences, or reordered list values). The intended experience is clear and predictable scoring: tolerant float matching, order-insensitive list matching, and unambiguous terminal reward outcomes with fewer frustrating “technically wrong but practically right” rejections. --- ## What Is Already Proven ### Verified in This Demo Run - Happy-path typed verification scenarios pass for integer, float, string, and list dispatch paths. - Full integration flow through environment `step()` passes for integer/float/string/list and fallback behavior. - Edge and error behavior is exercised locally: empty predicted input fails, float tolerance boundary checks pass/fail correctly, and integer coercion failure returns zero reward. - Existing smoke coverage for answer episode termination still passes. ### Previously Verified Evidence - `specs/FEATURES.json` (`F002.verification_evidence`) records verifier-approved run: `uv run pytest tests/ -v` with **65/65 passed** at `2026-03-27T22:33:12Z`. - `specs/F002-IMPLEMENTATION_SPEC.md` Section 7 records completed step evidence including full suite pass and integration pass. --- ## What Still Needs User Verification - Run one manual episode in your target runtime (your exact dataset/runtime environment) and submit a known-correct `ANSWER` with alternate formatting (for example `42.0` vs `42`) to confirm behavior in your end-to-end setup. --- ## Quickstart / Verification Steps > Run these commands to see the feature in action: ```bash uv run pytest tests/test_verifier_integration.py -v uv run pytest tests/test_verifier.py -v -k "test_verify_integer_exact_match or test_verify_float_within_tolerance or test_verify_string_case_insensitive or test_verify_list_order_insensitive" ``` Prerequisite: dependencies installed via `uv sync`. --- ## Live Local Proof ### Validate typed ANSWER handling through environment flow This runs the integration scenarios that exercise answer verification via the real environment step flow. ```bash uv run pytest tests/test_verifier_integration.py -v ``` ``` ============================= test session starts ============================== platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3 cachedir: .pytest_cache rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification configfile: pyproject.toml plugins: cov-7.1.0, anyio-4.13.0 collecting ... collected 6 items tests/test_verifier_integration.py::test_integer_answer_flow PASSED [ 16%] tests/test_verifier_integration.py::test_float_answer_flow PASSED [ 33%] tests/test_verifier_integration.py::test_string_answer_flow PASSED [ 50%] tests/test_verifier_integration.py::test_list_answer_flow PASSED [ 66%] tests/test_verifier_integration.py::test_fallback_when_answer_type_missing PASSED [ 83%] tests/test_verifier_integration.py::test_type_coercion_failure_returns_zero_reward PASSED [100%] ============================== 6 passed in 7.92s =============================== ``` What to notice: the flow covers all core answer types plus fallback and failure behavior in one environment-facing test surface. ### Confirm happy-path matching behavior for core answer types This run checks representative dispatcher-level happy cases. ```bash uv run pytest tests/test_verifier.py -v -k "test_verify_integer_exact_match or test_verify_float_within_tolerance or test_verify_string_case_insensitive or test_verify_list_order_insensitive" ``` ``` ============================= test session starts ============================== platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3 cachedir: .pytest_cache rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification configfile: pyproject.toml plugins: cov-7.1.0, anyio-4.13.0 collecting ... collected 34 items / 30 deselected / 4 selected tests/test_verifier.py::test_verify_integer_exact_match PASSED [ 25%] tests/test_verifier.py::test_verify_float_within_tolerance PASSED [ 50%] tests/test_verifier.py::test_verify_string_case_insensitive PASSED [ 75%] tests/test_verifier.py::test_verify_list_order_insensitive PASSED [100%] ======================= 4 passed, 30 deselected in 7.87s ======================= ``` What to notice: each answer type has at least one direct pass case that aligns to the feature’s success criteria. --- ## Existing Evidence - Prior full regression evidence (not re-run in this demo): `uv run pytest tests/ -v` => **65 passed** (`specs/FEATURES.json`, F002 verification evidence). --- ## Manual Verification Checklist 1. Start from a clean shell in project root and run `uv sync`. 2. Execute `uv run pytest tests/test_verifier_integration.py -v` and confirm all 6 integration tests pass. 3. Execute the happy-path dispatcher command from Quickstart and confirm 4 selected tests pass. 4. Optionally run `uv run pytest tests/ -v` to confirm no regressions outside F002. --- ## Edge Cases Exercised ### Empty predicted answer is rejected ```bash uv run pytest tests/test_verifier.py -v -k "test_verify_empty_predicted_returns_false" ``` ``` ============================= test session starts ============================== platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3 cachedir: .pytest_cache rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification configfile: pyproject.toml plugins: cov-7.1.0, anyio-4.13.0 collecting ... collected 34 items / 33 deselected / 1 selected tests/test_verifier.py::test_verify_empty_predicted_returns_false PASSED [100%] ======================= 1 passed, 33 deselected in 7.83s ======================= ``` This matters because blank answers should fail deterministically rather than being ambiguously normalized. ### Float tolerance boundary and non-numeric rejection ```bash uv run pytest tests/test_verifier.py -v -k "_compare_float" ``` ``` ============================= test session starts ============================== platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3 cachedir: .pytest_cache rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification configfile: pyproject.toml plugins: cov-7.1.0, anyio-4.13.0 collecting ... collected 34 items / 26 deselected / 8 selected tests/test_verifier.py::test_compare_float_exact_match PASSED [ 12%] tests/test_verifier.py::test_compare_float_within_1pct_tolerance PASSED [ 25%] tests/test_verifier.py::test_compare_float_outside_1pct_tolerance PASSED [ 37%] tests/test_verifier.py::test_compare_float_boundary_exactly_1pct PASSED [ 50%] tests/test_verifier.py::test_compare_float_just_over_1pct PASSED [ 62%] tests/test_verifier.py::test_compare_float_gold_zero_uses_absolute_tolerance PASSED [ 75%] tests/test_verifier.py::test_compare_float_gold_zero_fails_large_diff PASSED [ 87%] tests/test_verifier.py::test_compare_float_non_numeric_returns_false PASSED [100%] ======================= 8 passed, 26 deselected in 7.10s ======================= ``` This matters because it validates both tolerant matching and strict rejection when values exceed tolerance or are invalid. ### Type coercion failure returns zero reward in integration flow ```bash uv run pytest tests/test_verifier_integration.py -v -k "type_coercion_failure" ``` ``` ============================= test session starts ============================== platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3 cachedir: .pytest_cache rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification configfile: pyproject.toml plugins: cov-7.1.0, anyio-4.13.0 collecting ... collected 6 items / 5 deselected / 1 selected tests/test_verifier_integration.py::test_type_coercion_failure_returns_zero_reward PASSED [100%] ======================= 1 passed, 5 deselected in 6.95s ======================== ``` This matters because invalid numeric answers fail cleanly without crashing the answer flow. --- ## Test Evidence (Optional) > Supplementary proof that the feature works correctly across all scenarios. > The Live Demo section above shows how to use the feature; this section shows it was tested. | Test Suite | Tests | Status | |---|---|---| | `tests/test_verifier_integration.py` | 6 | All passed | | `tests/test_verifier.py` selected happy-path dispatcher tests | 4 selected | All passed | | `tests/test_verifier.py` selected float edge/error tests | 8 selected | All passed | | `tests/test_smoke.py` selected ANSWER compatibility test | 1 selected | All passed | Representative command: ```bash uv run pytest tests/test_smoke.py -v -k "answer" ``` ``` ============================= test session starts ============================== platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3 cachedir: .pytest_cache rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification configfile: pyproject.toml plugins: cov-7.1.0, anyio-4.13.0 collecting ... collected 25 items / 24 deselected / 1 selected tests/test_smoke.py::TestEnvironment::test_answer_ends_episode_without_budget_decrement PASSED [100%] ======================= 1 passed, 24 deselected in 7.49s ======================= ``` --- ## Feature Links - Implementation spec: `specs/F002-IMPLEMENTATION_SPEC.md` - Verification spec: `specs/F002-VERIFICATION_SPEC.md` --- *Demo generated by `feature-demo` agent. Re-run with `/feature-demo F002` to refresh.*