Implementation Specification
Change: F002 -- Answer Verification (multi-type comparison)
Date: 2026-03-27
Research Summary: specs/F002-RESEARCH_SUMMARY.md
Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
Behavior Delta: Archived into specs/behavior/sql-environment.md
Plan Status:
- Draft
- Approved for Implementation
- Implementation Complete
- Verification Passed
Core Intent (Immutable)
DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.
User Problem: When an agent submits ANSWER, the environment correctly determines if the answer matches the gold answer regardless of type (42 vs 42.0, 'Engineering' vs 'engineering', unordered lists).
Success Criteria:
- Float comparison with tolerance handles rounding gracefully (95000.1 matches 95000)
- List comparison ignores order: ['A','B'] matches ['B','A']
- Clear pass/fail with no ambiguity
Avoid:
- Correct answer rejected due to trivial formatting difference
- Type coercion failures (agent says '42', gold is integer 42)
Out of Scope:
- Table comparison (multi-column row overlap) -- deferred to post-MVP
- Partial credit scoring -- binary pass/fail only at this layer
- Changes to reward signal structure (F003 scope)
0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in small, mergeable increments.
Scope Budget
- Target: 2 slices
- Hard max: <= 10 steps total
- Each step must end in: implement -> verify -> merge
Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.
Each slice must have:
- Clear outcome
- Minimal interface change
- Merge criteria
Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).
Status Icons
Step Status:
- ??? Not Started
- ? In Progress
- ? Completed
- ? Blocked/Failed
Result Outcome:
- ? Fully Successful (all tests passed, no issues)
- ?? Completed with Issues (needs follow-up)
- ? Failed/Blocked
1. Implementation Overview
Summary
Implement verify_answer() in server/verifier.py with type-aware comparison dispatching across four answer types (integer, float, string, list). Wire it into _handle_answer() in server/sql_environment.py, replacing the naive string comparison. Add gold_rows field to EpisodeContext so the verifier receives raw data for accurate list comparison. Fallback to string comparison when answer_type is missing.
Scope
In Scope:
verify_answer()public function with 4 type comparers- Private helpers:
_normalize_value,_compare_integer,_compare_float,_compare_string,_compare_list gold_rowsfield onEpisodeContext- Integration into
_handle_answer() - Unit tests for all comparers and edge cases
Out of Scope:
- Table comparison (multi-column)
- Partial credit / dense reward (F003)
- Changes to question data schema (answer_type already exists)
- External dependencies (pure Python only)
1a. Execution Status
Progress: 4/4 steps complete Current Step: None (all implementation steps complete) Last Updated: 2026-03-27T22:33:12Z Latest Result: Fully Successful (all tests passed, no issues) Blockers: None
1b. Risk Assessment
Risk Tier: Low
High-Risk Indicators Present: (none apply)
- Touches authentication or authorization logic
- Handles payment processing or financial data
- Manages secrets, API keys, or credentials
- Processes untrusted user input (file uploads, external APIs)
- Modifies privilege/permission systems
Security Review Required: No
Justification: Pure logic module that compares two values. No user input beyond agent's ANSWER string (already sanitized by action parsing). No I/O, no network, no secrets.
2. Change Manifest
Files to Create
| File | Purpose |
|---|---|
tests/test_verifier.py |
Unit tests for all comparison types and edge cases |
Files to Modify
| File | Changes |
|---|---|
server/verifier.py |
Replace stub with full verify_answer() + private helpers |
models.py |
Add `gold_rows: list[tuple] |
server/sql_environment.py |
Wire verify_answer() into _handle_answer(), populate gold_rows |
Files to Delete
None.
3. Interface Specifications
Modified Types
# Location: models.py
# CHANGE: Add gold_rows field to EpisodeContext
@dataclass
class EpisodeContext:
"""Per-episode server-side state (never sent to agent)."""
episode_id: str
db_connection: sqlite3.Connection
question_record: QuestionRecord
step_count: int = 0
budget: int = 15
described_tables: set[str] = dataclass_field(default_factory=set)
action_log: list[str] = dataclass_field(default_factory=list)
done: bool = False
gold_answer: str | None = None
gold_rows: list[tuple] | None = None # NEW: raw SQL result rows for verifier
New Functions
# Location: server/verifier.py
def verify_answer(
predicted: str,
gold: str,
answer_type: str | None = None,
gold_rows: list[tuple] | None = None,
) -> bool:
"""
Compare agent's submitted answer against the gold answer.
Dispatches to type-specific comparers based on answer_type.
Falls back to string comparison when answer_type is None or unknown.
Args:
predicted: The agent's submitted answer string.
gold: The gold answer as a formatted string.
answer_type: One of "integer", "float", "string", "list", or None.
gold_rows: Raw SQL result rows (list of tuples) for accurate list comparison.
Returns:
True if the answer is correct, False otherwise.
"""
# Location: server/verifier.py (private helpers)
def _normalize_value(value: str) -> str:
"""Strip whitespace and lowercase a value for comparison."""
def _compare_integer(predicted: str, gold: str) -> bool:
"""
Compare as integers after coercing both sides.
Handles: "42" vs 42, "42.0" vs 42.
Returns False on ValueError (non-numeric input).
"""
def _compare_float(predicted: str, gold: str, tolerance: float = 0.01) -> bool:
"""
Compare as floats with relative tolerance (default 1%).
Uses: abs(pred - gold) <= tolerance * abs(gold) when gold != 0.
For gold == 0: uses absolute tolerance of 1e-9.
Returns False on ValueError.
"""
def _compare_string(predicted: str, gold: str) -> bool:
"""Case-insensitive, whitespace-normalized string comparison."""
def _compare_list(
predicted: str,
gold: str,
gold_rows: list[tuple] | None = None,
) -> bool:
"""
Order-insensitive set comparison.
If gold_rows is provided, converts both sides to sets of normalized strings.
Otherwise parses the formatted string (split on ' | ' and newlines).
"""
Modified Functions
# Location: server/sql_environment.py
# CHANGE: Replace naive comparison with verify_answer() call
def _handle_answer(self, value: str) -> tuple[bool, float]:
"""Compare submitted answer against episode gold answer using type-aware verifier."""
if self._episode is None:
raise RuntimeError("No active episode. Call reset() before step().")
is_correct = verify_answer(
predicted=value,
gold=self._episode.gold_answer or "",
answer_type=self._episode.question_record.answer_type,
gold_rows=self._episode.gold_rows,
)
self._episode.done = True
return is_correct, 1.0 if is_correct else 0.0
4. Data Flow
Primary Flow
1. Agent sends ANSWER action with value string
- Input: action.argument (str)
2. step() dispatches to _handle_answer(value)
- Input: value (str)
3. _handle_answer() calls verify_answer(predicted, gold, answer_type, gold_rows)
- predicted: value (agent's answer)
- gold: self._episode.gold_answer (formatted string)
- answer_type: self._episode.question_record.answer_type
- gold_rows: self._episode.gold_rows (raw tuples or None)
4. verify_answer() dispatches by answer_type:
- "integer" -> _compare_integer(predicted, gold)
- "float" -> _compare_float(predicted, gold)
- "string" -> _compare_string(predicted, gold)
- "list" -> _compare_list(predicted, gold, gold_rows)
- None/unknown -> _compare_string(predicted, gold)
5. Returns bool -> _handle_answer returns (bool, float reward)
Alternative Flows
When answer_type is None or unknown:
1. verify_answer receives answer_type=None
2. Falls back to _compare_string(predicted, gold)
3. Returns bool (case-insensitive normalized comparison)
When predicted or gold is empty/None:
1. verify_answer receives empty string or None-coerced value
2. Returns False immediately (no valid answer to compare)
When type coercion fails (e.g., "abc" as integer):
1. _compare_integer or _compare_float catches ValueError
2. Falls back to returning False
5. Error Handling
Error Types
| Error | When | Behavior |
|---|---|---|
ValueError (caught internally) |
Predicted value cannot be coerced to int/float | Return False (not correct) |
RuntimeError |
_handle_answer called with no active episode |
Raised to caller (existing behavior) |
Error Handling Strategy
# Pattern: catch coercion errors, return False (answer is wrong, not a crash)
def _compare_integer(predicted: str, gold: str) -> bool:
try:
return int(float(predicted)) == int(float(gold))
except (ValueError, TypeError):
return False
Retry Strategy
| Operation | Retry? | Strategy |
|---|---|---|
verify_answer() |
No | Deterministic comparison, no transient failures |
6. Slice Plan (What we will ship, in order)
Slice S1 -- Core Verifier Module
Value: verify_answer() exists as a tested, standalone module with all 4 type comparers
User-visible change: No (not yet wired in)
Interfaces introduced/changed: verify_answer(), _normalize_value(), _compare_integer(), _compare_float(), _compare_string(), _compare_list()
Rollback safety: Additive only -- new file, no existing code changed
Slice S2 -- Integration and Wiring
Value: _handle_answer() uses type-aware verification; agents get correct results for float/list/integer answers
User-visible change: Yes -- agent answers previously rejected (e.g., "42" vs integer 42) now accepted
Interfaces introduced/changed: EpisodeContext.gold_rows, modified _handle_answer()
Rollback safety: Revert to naive string compare by removing import and restoring 3 lines
7. Implementation Steps
VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.
Step 1.1: Implement verify_answer module
Slice: S1
Goal: Create the complete verify_answer() function with all 4 type-specific comparers in server/verifier.py.
Files:
server/verifier.py- modify - Replace stub with full implementation
Interface Changes:
- New public function:
verify_answer(predicted, gold, answer_type, gold_rows) -> bool - New private helpers:
_normalize_value,_compare_integer,_compare_float,_compare_string,_compare_list
Implementation Details:
- Replace the docstring-only stub in
server/verifier.pywith the full module. verify_answer()uses match/case onanswer_typeto dispatch._normalize_value(value):value.strip().lower()._compare_integer(pred, gold): coerce both viaint(float(x)), exact match. Catch ValueError -> False._compare_float(pred, gold, tolerance=0.01): relative toleranceabs(p - g) <= tol * abs(g). For g==0, absolute tolerance 1e-9. Catch ValueError -> False._compare_string(pred, gold):_normalize_value(pred) == _normalize_value(gold)._compare_list(pred, gold, gold_rows): Ifgold_rowsis provided, build gold set from{str(cell) for row in gold_rows for cell in row}. Parse predicted by splitting on,and\n. Normalize both sides, compare as sets. If nogold_rows, parse gold string by splitting on|and\n.- Guard: if
predictedis empty after strip, return False immediately.
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-27T22:18:15Z Changes Made:
server/verifier.py- replaced stub content withverify_answer()and helper comparers for integer, float, string, and list handling.
Result:
- Outcome: Fully Successful
- Evidence Captured:
uv run --extra dev pytest tests/ -v ======================== 25 passed in 81.43s ========================= - Tests run:
uv run --extra dev pytest tests/ -v - Notes:
- Implemented
verify_answer()dispatch with fallback to normalized string comparison for unknown or missing answer types. - Added deterministic helper behavior: integer coercion via
int(float(x)), float relative tolerance (1%), and list set comparison. - Used
uv run --extra devbecause local environment did not yet include pytest from dev extras.
- Implemented
- Issues: None | [short bullet list if any]
- Follow-ups Created: None | [list of new step IDs if issues spawned new steps]
- Human Review Completed: N/A
Context for Next Step:
- Add
tests/test_verifier.pycoverage for dispatcher paths, comparer edge cases, and fallback logic fromspecs/F002-VERIFICATION_SPEC.md.
Step 1.2: Unit tests for verifier
Slice: S1 Goal: Create comprehensive unit tests covering all 4 answer types, edge cases, and the fallback path.
Files:
tests/test_verifier.py- create - Unit tests for verify_answer and all comparers
Interface Changes: None (test-only)
Implementation Details:
- Test
_compare_integer: "42" vs "42", "42.0" vs "42", "abc" vs "42" (False), "" vs "42" (False). - Test
_compare_float: "95000.1" vs "95000" (True, within 1%), "100" vs "200" (False), "0" vs "0" (True), "abc" vs "1.0" (False). - Test
_compare_string: "Engineering" vs "engineering" (True), " hello " vs "hello" (True), "a" vs "b" (False). - Test
_compare_list: "A, B" vs "B, A" (True), "A" vs "A, B" (False), test with gold_rows provided. - Test
verify_answerdispatch: each type routes correctly, None/unknown falls back to string. - Test edge cases: empty predicted (False), None gold coerced to "" (False).
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-27T22:21:30Z Changes Made:
tests/test_verifier.py- created comprehensive unit coverage for verifier dispatch and helper comparers across integer, float, string, and list cases.
Result:
- Outcome: Fully Successful
- Evidence Captured:
uv run pytest tests/test_verifier.py -v ============================== 31 passed in 6.19s ============================== - Tests run:
uv run pytest tests/test_verifier.py -v - Notes:
- Added dispatcher tests for all answer types plus fallback and empty-predicted guards.
- Added comparer edge-case tests (int truncation, float tolerance boundaries, list parsing with/without
gold_rows). - Kept coverage aligned to existing verifier behavior (normalized whitespace/case comparison).
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Add
gold_rowstoEpisodeContextinmodels.pyand persist raw gold query rows duringreset()inserver/sql_environment.py.
Step 2.1: Add gold_rows to EpisodeContext and populate during reset
Slice: S2
Goal: Add gold_rows field to EpisodeContext and populate it when an episode is reset (alongside gold_answer).
Files:
models.py- modify - Addgold_rows: list[tuple] | None = Noneto EpisodeContextserver/sql_environment.py- modify - Populategold_rowsduring episode reset wheregold_answeris set
Interface Changes:
EpisodeContext.gold_rows: list[tuple] | None = None(new field)
Implementation Details:
- Add
gold_rows: list[tuple] | None = NonetoEpisodeContextdataclass aftergold_answer. - In
sql_environment.py, find wheregold_answeris populated duringreset(). At the same location, store the raw rows ingold_rowsbefore they are formatted.
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-27T22:24:54Z Changes Made:
models.py- addedgold_rows: list[tuple] | None = NonetoEpisodeContext.server/sql_environment.py- persisted raw gold query rows intoEpisodeContext.gold_rowsduringreset().tests/test_verifier.py- addedEpisodeContext.gold_rowsunit tests (defaultNone, populated list, empty list).
Result:
- Outcome: Fully Successful
- Evidence Captured:
uv run pytest tests/test_verifier.py -v ============================== 34 passed in 6.18s ============================== - Tests run:
uv run pytest tests/test_verifier.py -v - Notes:
- Stored structured
gold_rowsat reset-time where gold SQL is already executed, so no extra SQL execution path was introduced. - Added direct dataclass tests for
EpisodeContext.gold_rowsto satisfy verification criteria for the new interface field.
- Stored structured
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Replace
_handle_answer()naive normalized string equality withverify_answer(predicted, gold, answer_type, gold_rows)and keep terminal reward mapping unchanged.
Step 2.2: Wire verify_answer into _handle_answer
Slice: S2
Goal: Replace naive string comparison in _handle_answer() with verify_answer() call.
Files:
server/sql_environment.py- modify - Import and callverify_answer()in_handle_answer()
Interface Changes:
- Modified function:
_handle_answer()now delegates toverify_answer()
Implementation Details:
- Add import:
from server.verifier import verify_answerat top ofsql_environment.py. - Replace the body of
_handle_answer():- Remove:
submitted = value.strip().lower()/expected = .../is_correct = submitted == expected - Add:
is_correct = verify_answer(predicted=value, gold=self._episode.gold_answer or "", answer_type=self._episode.question_record.answer_type, gold_rows=self._episode.gold_rows)
- Remove:
- Keep:
self._episode.done = Trueandreturn is_correct, 1.0 if is_correct else 0.0 - Run existing smoke tests to confirm no regressions.
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
- Existing 25 smoke tests still pass
Status: Completed
Completed: 2026-03-27T22:33:12Z Changes Made:
server/sql_environment.py- importedverify_answerand replaced_handle_answer()naive normalized-string equality withverify_answer(predicted, gold, answer_type, gold_rows).tests/test_verifier_integration.py- added integration coverage for integer/float/string/list answer flows, fallback behavior for missinganswer_type, and numeric coercion failure path.
Result:
- Outcome: Fully Successful
- Evidence Captured:
uv run pytest tests/test_verifier.py -v ============================== 34 passed in 6.64s ============================== uv run pytest tests/test_smoke.py -v ============================== 25 passed in 6.53s ============================== uv run pytest tests/test_verifier_integration.py -v ============================== 6 passed in 6.65s ============================== uv run pytest tests/ -v ============================== 65 passed in 6.62s ============================== - Tests run:
uv run pytest tests/test_verifier.py -v;uv run pytest tests/test_smoke.py -v;uv run pytest tests/test_verifier_integration.py -v;uv run pytest tests/ -v - Notes:
_handle_answer()now uses a single verifier dispatch path, keeping answer comparison logic centralized inserver/verifier.py.- Added integration tests because
VERIFICATION_SPEC.mdexpectedtests/test_verifier_integration.pyevidence. - Behavior delta was archived into
specs/behavior/sql-environment.mdand the delta file was removed.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Implementation complete. Proceed with commit/PR workflow (
/commit-push-pr) for F002.
8. Rollout Considerations
Feature Flags
- Required: No
- Flag name: N/A
Migration
- Data migration needed: No
- Migration strategy: N/A
Rollback Plan
Revert _handle_answer() to inline string comparison (3 lines). The verify_answer() module and gold_rows field are additive and harmless if unused.
9. Execution Tracking
All execution state is tracked within this document:
- Section 1a: Overall progress summary
- Section 7: Per-step completion details, test results, and handoff context
- FEATURES.json: Feature-level status/progress metadata used by
/autocode-next-stepandopencode-ctx ralph run - Git history: Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's
progressandstatusfields inFEATURES.json - Running
git log --oneline IMPLEMENTATION_SPEC.mdfor change history
9a. Slice Completion Protocol
After all steps in a slice pass verification:
Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete
Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.
10. User Value Summary
Status: Generated
What Users Can Now Do
Users can now submit answers across integer, float, string, and list questions and get correct pass/fail outcomes even when answers differ in formatting, case, numeric representation, or list ordering.
How to Access/Test
Run uv run pytest tests/test_verifier.py tests/test_verifier_integration.py -v, or run uv run pytest tests/ -v for full regression coverage including end-to-end ANSWER handling through SQLEnvironment.step().
Demo
- Command:
uv run pytest tests/test_verifier_integration.py -v
Release Notes Snippet
Added type-aware answer verification so ANSWER correctness now supports numeric coercion, float tolerance, case-insensitive strings, and order-insensitive list matching.
11. PR Contract (Auto-Generated by autocode-next-step)
Status: Generated
Summary
- Implemented type-aware answer verification in environment answer handling by routing
_handle_answer()throughverify_answer(). - Added integration coverage for typed answer paths and fallback behavior (
tests/test_verifier_integration.py). - Archived F002 behavior delta into
specs/behavior/sql-environment.mdand captured durable learnings indocs/learnings/F002-*.md.
Validation
uv run pytest tests/test_verifier.py -v-> 34 passeduv run pytest tests/test_smoke.py -v-> 25 passeduv run pytest tests/test_verifier_integration.py -v-> 6 passeduv run pytest tests/ -v-> 65 passed
Scope and Risk
- Risk tier: Low
- Security-sensitive changes: None
- Scope creep: None (added integration tests to satisfy verification spec evidence requirements)
Ready Action
All steps completed. Run /commit-push-pr.
PR Created
https://github.com/hjerpe/sql-env/pull/7
Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than 3 files in unrelated areas
- You need to introduce multiple new abstractions "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
Human Checkpoint
Before handing to AI agent:
- Interface specifications are complete
- Data flow is accurate
- Error handling is specified
- Implementation order makes sense
- VERIFICATION_SPEC.md has been generated
Questions:
- Should float tolerance be configurable per-question or fixed at 1%?
- Any additional answer_type values beyond the four specified?
Handoff Notes
For the implementing AI agent:
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
- gold_rows passed raw to verifier (not just formatted string)
- Fallback to string comparison when answer_type is None/unknown
- No external dependencies -- pure Python only
- match/case dispatch, not class hierarchy
Specification completed: 2026-03-27 Approved by: [NAME/ROLE] Verification spec: VERIFICATION_SPEC.md Verification input: F002-VERIFICATION_INPUT.json Target agent: Claude Code