sql_env / specs /F002-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Implementation Specification

Change: F002 -- Answer Verification (multi-type comparison) Date: 2026-03-27 Research Summary: specs/F002-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived into specs/behavior/sql-environment.md

Plan Status:

  • Draft
  • Approved for Implementation
  • Implementation Complete
  • Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: When an agent submits ANSWER, the environment correctly determines if the answer matches the gold answer regardless of type (42 vs 42.0, 'Engineering' vs 'engineering', unordered lists).

Success Criteria:

  • Float comparison with tolerance handles rounding gracefully (95000.1 matches 95000)
  • List comparison ignores order: ['A','B'] matches ['B','A']
  • Clear pass/fail with no ambiguity

Avoid:

  • Correct answer rejected due to trivial formatting difference
  • Type coercion failures (agent says '42', gold is integer 42)

Out of Scope:

  • Table comparison (multi-column row overlap) -- deferred to post-MVP
  • Partial credit scoring -- binary pass/fail only at this layer
  • Changes to reward signal structure (F003 scope)

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

  • Target: 2 slices
  • Hard max: <= 10 steps total
  • Each step must end in: implement -> verify -> merge

Slice Definition

A slice is a vertical increment that delivers user-visible value or a safe internal capability.

Each slice must have:

  • Clear outcome
  • Minimal interface change
  • Merge criteria

Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

Status Icons

Step Status:

  • ??? Not Started
  • ? In Progress
  • ? Completed
  • ? Blocked/Failed

Result Outcome:

  • ? Fully Successful (all tests passed, no issues)
  • ?? Completed with Issues (needs follow-up)
  • ? Failed/Blocked

1. Implementation Overview

Summary

Implement verify_answer() in server/verifier.py with type-aware comparison dispatching across four answer types (integer, float, string, list). Wire it into _handle_answer() in server/sql_environment.py, replacing the naive string comparison. Add gold_rows field to EpisodeContext so the verifier receives raw data for accurate list comparison. Fallback to string comparison when answer_type is missing.

Scope

In Scope:

  • verify_answer() public function with 4 type comparers
  • Private helpers: _normalize_value, _compare_integer, _compare_float, _compare_string, _compare_list
  • gold_rows field on EpisodeContext
  • Integration into _handle_answer()
  • Unit tests for all comparers and edge cases

Out of Scope:

  • Table comparison (multi-column)
  • Partial credit / dense reward (F003)
  • Changes to question data schema (answer_type already exists)
  • External dependencies (pure Python only)

1a. Execution Status

Progress: 4/4 steps complete Current Step: None (all implementation steps complete) Last Updated: 2026-03-27T22:33:12Z Latest Result: Fully Successful (all tests passed, no issues) Blockers: None


1b. Risk Assessment

Risk Tier: Low

High-Risk Indicators Present: (none apply)

  • Touches authentication or authorization logic
  • Handles payment processing or financial data
  • Manages secrets, API keys, or credentials
  • Processes untrusted user input (file uploads, external APIs)
  • Modifies privilege/permission systems

Security Review Required: No

Justification: Pure logic module that compares two values. No user input beyond agent's ANSWER string (already sanitized by action parsing). No I/O, no network, no secrets.


2. Change Manifest

Files to Create

File Purpose
tests/test_verifier.py Unit tests for all comparison types and edge cases

Files to Modify

File Changes
server/verifier.py Replace stub with full verify_answer() + private helpers
models.py Add `gold_rows: list[tuple]
server/sql_environment.py Wire verify_answer() into _handle_answer(), populate gold_rows

Files to Delete

None.


3. Interface Specifications

Modified Types

# Location: models.py
# CHANGE: Add gold_rows field to EpisodeContext

@dataclass
class EpisodeContext:
    """Per-episode server-side state (never sent to agent)."""

    episode_id: str
    db_connection: sqlite3.Connection
    question_record: QuestionRecord
    step_count: int = 0
    budget: int = 15
    described_tables: set[str] = dataclass_field(default_factory=set)
    action_log: list[str] = dataclass_field(default_factory=list)
    done: bool = False
    gold_answer: str | None = None
    gold_rows: list[tuple] | None = None  # NEW: raw SQL result rows for verifier

New Functions

# Location: server/verifier.py

def verify_answer(
    predicted: str,
    gold: str,
    answer_type: str | None = None,
    gold_rows: list[tuple] | None = None,
) -> bool:
    """
    Compare agent's submitted answer against the gold answer.

    Dispatches to type-specific comparers based on answer_type.
    Falls back to string comparison when answer_type is None or unknown.

    Args:
        predicted: The agent's submitted answer string.
        gold: The gold answer as a formatted string.
        answer_type: One of "integer", "float", "string", "list", or None.
        gold_rows: Raw SQL result rows (list of tuples) for accurate list comparison.

    Returns:
        True if the answer is correct, False otherwise.
    """
# Location: server/verifier.py (private helpers)

def _normalize_value(value: str) -> str:
    """Strip whitespace and lowercase a value for comparison."""

def _compare_integer(predicted: str, gold: str) -> bool:
    """
    Compare as integers after coercing both sides.

    Handles: "42" vs 42, "42.0" vs 42.
    Returns False on ValueError (non-numeric input).
    """

def _compare_float(predicted: str, gold: str, tolerance: float = 0.01) -> bool:
    """
    Compare as floats with relative tolerance (default 1%).

    Uses: abs(pred - gold) <= tolerance * abs(gold) when gold != 0.
    For gold == 0: uses absolute tolerance of 1e-9.
    Returns False on ValueError.
    """

def _compare_string(predicted: str, gold: str) -> bool:
    """Case-insensitive, whitespace-normalized string comparison."""

def _compare_list(
    predicted: str,
    gold: str,
    gold_rows: list[tuple] | None = None,
) -> bool:
    """
    Order-insensitive set comparison.

    If gold_rows is provided, converts both sides to sets of normalized strings.
    Otherwise parses the formatted string (split on ' | ' and newlines).
    """

Modified Functions

# Location: server/sql_environment.py
# CHANGE: Replace naive comparison with verify_answer() call

def _handle_answer(self, value: str) -> tuple[bool, float]:
    """Compare submitted answer against episode gold answer using type-aware verifier."""
    if self._episode is None:
        raise RuntimeError("No active episode. Call reset() before step().")

    is_correct = verify_answer(
        predicted=value,
        gold=self._episode.gold_answer or "",
        answer_type=self._episode.question_record.answer_type,
        gold_rows=self._episode.gold_rows,
    )
    self._episode.done = True
    return is_correct, 1.0 if is_correct else 0.0

4. Data Flow

Primary Flow

1. Agent sends ANSWER action with value string
   - Input: action.argument (str)

2. step() dispatches to _handle_answer(value)
   - Input: value (str)

3. _handle_answer() calls verify_answer(predicted, gold, answer_type, gold_rows)
   - predicted: value (agent's answer)
   - gold: self._episode.gold_answer (formatted string)
   - answer_type: self._episode.question_record.answer_type
   - gold_rows: self._episode.gold_rows (raw tuples or None)

4. verify_answer() dispatches by answer_type:
   - "integer" -> _compare_integer(predicted, gold)
   - "float"   -> _compare_float(predicted, gold)
   - "string"  -> _compare_string(predicted, gold)
   - "list"    -> _compare_list(predicted, gold, gold_rows)
   - None/unknown -> _compare_string(predicted, gold)

5. Returns bool -> _handle_answer returns (bool, float reward)

Alternative Flows

When answer_type is None or unknown:

1. verify_answer receives answer_type=None
2. Falls back to _compare_string(predicted, gold)
3. Returns bool (case-insensitive normalized comparison)

When predicted or gold is empty/None:

1. verify_answer receives empty string or None-coerced value
2. Returns False immediately (no valid answer to compare)

When type coercion fails (e.g., "abc" as integer):

1. _compare_integer or _compare_float catches ValueError
2. Falls back to returning False

5. Error Handling

Error Types

Error When Behavior
ValueError (caught internally) Predicted value cannot be coerced to int/float Return False (not correct)
RuntimeError _handle_answer called with no active episode Raised to caller (existing behavior)

Error Handling Strategy

# Pattern: catch coercion errors, return False (answer is wrong, not a crash)
def _compare_integer(predicted: str, gold: str) -> bool:
    try:
        return int(float(predicted)) == int(float(gold))
    except (ValueError, TypeError):
        return False

Retry Strategy

Operation Retry? Strategy
verify_answer() No Deterministic comparison, no transient failures

6. Slice Plan (What we will ship, in order)

Slice S1 -- Core Verifier Module

Value: verify_answer() exists as a tested, standalone module with all 4 type comparers User-visible change: No (not yet wired in) Interfaces introduced/changed: verify_answer(), _normalize_value(), _compare_integer(), _compare_float(), _compare_string(), _compare_list() Rollback safety: Additive only -- new file, no existing code changed

Slice S2 -- Integration and Wiring

Value: _handle_answer() uses type-aware verification; agents get correct results for float/list/integer answers User-visible change: Yes -- agent answers previously rejected (e.g., "42" vs integer 42) now accepted Interfaces introduced/changed: EpisodeContext.gold_rows, modified _handle_answer() Rollback safety: Revert to naive string compare by removing import and restoring 3 lines


7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Implement verify_answer module

Slice: S1 Goal: Create the complete verify_answer() function with all 4 type-specific comparers in server/verifier.py.

Files:

  • server/verifier.py - modify - Replace stub with full implementation

Interface Changes:

  • New public function: verify_answer(predicted, gold, answer_type, gold_rows) -> bool
  • New private helpers: _normalize_value, _compare_integer, _compare_float, _compare_string, _compare_list

Implementation Details:

  1. Replace the docstring-only stub in server/verifier.py with the full module.
  2. verify_answer() uses match/case on answer_type to dispatch.
  3. _normalize_value(value): value.strip().lower().
  4. _compare_integer(pred, gold): coerce both via int(float(x)), exact match. Catch ValueError -> False.
  5. _compare_float(pred, gold, tolerance=0.01): relative tolerance abs(p - g) <= tol * abs(g). For g==0, absolute tolerance 1e-9. Catch ValueError -> False.
  6. _compare_string(pred, gold): _normalize_value(pred) == _normalize_value(gold).
  7. _compare_list(pred, gold, gold_rows): If gold_rows is provided, build gold set from {str(cell) for row in gold_rows for cell in row}. Parse predicted by splitting on , and \n. Normalize both sides, compare as sets. If no gold_rows, parse gold string by splitting on | and \n.
  8. Guard: if predicted is empty after strip, return False immediately.

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T22:18:15Z Changes Made:

  • server/verifier.py - replaced stub content with verify_answer() and helper comparers for integer, float, string, and list handling.

Result:

  • Outcome: Fully Successful
  • Evidence Captured:
    uv run --extra dev pytest tests/ -v
    ======================== 25 passed in 81.43s =========================
    
  • Tests run: uv run --extra dev pytest tests/ -v
  • Notes:
    • Implemented verify_answer() dispatch with fallback to normalized string comparison for unknown or missing answer types.
    • Added deterministic helper behavior: integer coercion via int(float(x)), float relative tolerance (1%), and list set comparison.
    • Used uv run --extra dev because local environment did not yet include pytest from dev extras.
  • Issues: None | [short bullet list if any]
  • Follow-ups Created: None | [list of new step IDs if issues spawned new steps]
  • Human Review Completed: N/A

Context for Next Step:

  • Add tests/test_verifier.py coverage for dispatcher paths, comparer edge cases, and fallback logic from specs/F002-VERIFICATION_SPEC.md.

Step 1.2: Unit tests for verifier

Slice: S1 Goal: Create comprehensive unit tests covering all 4 answer types, edge cases, and the fallback path.

Files:

  • tests/test_verifier.py - create - Unit tests for verify_answer and all comparers

Interface Changes: None (test-only)

Implementation Details:

  1. Test _compare_integer: "42" vs "42", "42.0" vs "42", "abc" vs "42" (False), "" vs "42" (False).
  2. Test _compare_float: "95000.1" vs "95000" (True, within 1%), "100" vs "200" (False), "0" vs "0" (True), "abc" vs "1.0" (False).
  3. Test _compare_string: "Engineering" vs "engineering" (True), " hello " vs "hello" (True), "a" vs "b" (False).
  4. Test _compare_list: "A, B" vs "B, A" (True), "A" vs "A, B" (False), test with gold_rows provided.
  5. Test verify_answer dispatch: each type routes correctly, None/unknown falls back to string.
  6. Test edge cases: empty predicted (False), None gold coerced to "" (False).

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T22:21:30Z Changes Made:

  • tests/test_verifier.py - created comprehensive unit coverage for verifier dispatch and helper comparers across integer, float, string, and list cases.

Result:

  • Outcome: Fully Successful
  • Evidence Captured:
    uv run pytest tests/test_verifier.py -v
    ============================== 31 passed in 6.19s ==============================
    
  • Tests run: uv run pytest tests/test_verifier.py -v
  • Notes:
    • Added dispatcher tests for all answer types plus fallback and empty-predicted guards.
    • Added comparer edge-case tests (int truncation, float tolerance boundaries, list parsing with/without gold_rows).
    • Kept coverage aligned to existing verifier behavior (normalized whitespace/case comparison).
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Add gold_rows to EpisodeContext in models.py and persist raw gold query rows during reset() in server/sql_environment.py.

Step 2.1: Add gold_rows to EpisodeContext and populate during reset

Slice: S2 Goal: Add gold_rows field to EpisodeContext and populate it when an episode is reset (alongside gold_answer).

Files:

  • models.py - modify - Add gold_rows: list[tuple] | None = None to EpisodeContext
  • server/sql_environment.py - modify - Populate gold_rows during episode reset where gold_answer is set

Interface Changes:

  • EpisodeContext.gold_rows: list[tuple] | None = None (new field)

Implementation Details:

  1. Add gold_rows: list[tuple] | None = None to EpisodeContext dataclass after gold_answer.
  2. In sql_environment.py, find where gold_answer is populated during reset(). At the same location, store the raw rows in gold_rows before they are formatted.

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T22:24:54Z Changes Made:

  • models.py - added gold_rows: list[tuple] | None = None to EpisodeContext.
  • server/sql_environment.py - persisted raw gold query rows into EpisodeContext.gold_rows during reset().
  • tests/test_verifier.py - added EpisodeContext.gold_rows unit tests (default None, populated list, empty list).

Result:

  • Outcome: Fully Successful
  • Evidence Captured:
    uv run pytest tests/test_verifier.py -v
    ============================== 34 passed in 6.18s ==============================
    
  • Tests run: uv run pytest tests/test_verifier.py -v
  • Notes:
    • Stored structured gold_rows at reset-time where gold SQL is already executed, so no extra SQL execution path was introduced.
    • Added direct dataclass tests for EpisodeContext.gold_rows to satisfy verification criteria for the new interface field.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Replace _handle_answer() naive normalized string equality with verify_answer(predicted, gold, answer_type, gold_rows) and keep terminal reward mapping unchanged.

Step 2.2: Wire verify_answer into _handle_answer

Slice: S2 Goal: Replace naive string comparison in _handle_answer() with verify_answer() call.

Files:

  • server/sql_environment.py - modify - Import and call verify_answer() in _handle_answer()

Interface Changes:

  • Modified function: _handle_answer() now delegates to verify_answer()

Implementation Details:

  1. Add import: from server.verifier import verify_answer at top of sql_environment.py.
  2. Replace the body of _handle_answer():
    • Remove: submitted = value.strip().lower() / expected = ... / is_correct = submitted == expected
    • Add: is_correct = verify_answer(predicted=value, gold=self._episode.gold_answer or "", answer_type=self._episode.question_record.answer_type, gold_rows=self._episode.gold_rows)
  3. Keep: self._episode.done = True and return is_correct, 1.0 if is_correct else 0.0
  4. Run existing smoke tests to confirm no regressions.

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)
  • Existing 25 smoke tests still pass

Status: Completed

Completed: 2026-03-27T22:33:12Z Changes Made:

  • server/sql_environment.py - imported verify_answer and replaced _handle_answer() naive normalized-string equality with verify_answer(predicted, gold, answer_type, gold_rows).
  • tests/test_verifier_integration.py - added integration coverage for integer/float/string/list answer flows, fallback behavior for missing answer_type, and numeric coercion failure path.

Result:

  • Outcome: Fully Successful
  • Evidence Captured:
    uv run pytest tests/test_verifier.py -v
    ============================== 34 passed in 6.64s ==============================
    
    uv run pytest tests/test_smoke.py -v
    ============================== 25 passed in 6.53s ==============================
    
    uv run pytest tests/test_verifier_integration.py -v
    ============================== 6 passed in 6.65s ==============================
    
    uv run pytest tests/ -v
    ============================== 65 passed in 6.62s ==============================
    
  • Tests run: uv run pytest tests/test_verifier.py -v; uv run pytest tests/test_smoke.py -v; uv run pytest tests/test_verifier_integration.py -v; uv run pytest tests/ -v
  • Notes:
    • _handle_answer() now uses a single verifier dispatch path, keeping answer comparison logic centralized in server/verifier.py.
    • Added integration tests because VERIFICATION_SPEC.md expected tests/test_verifier_integration.py evidence.
    • Behavior delta was archived into specs/behavior/sql-environment.md and the delta file was removed.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Implementation complete. Proceed with commit/PR workflow (/commit-push-pr) for F002.

8. Rollout Considerations

Feature Flags

  • Required: No
  • Flag name: N/A

Migration

  • Data migration needed: No
  • Migration strategy: N/A

Rollback Plan

Revert _handle_answer() to inline string comparison (3 lines). The verify_answer() module and gold_rows field are additive and harmless if unused.


9. Execution Tracking

All execution state is tracked within this document:

  • Section 1a: Overall progress summary
  • Section 7: Per-step completion details, test results, and handoff context
  • FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
  • Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

  • Checking Section 1a for summary
  • Reviewing Section 7 for detailed step status
  • Inspecting the feature's progress and status fields in FEATURES.json
  • Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

  1. Run verifier subagent for spec compliance

    • Validates against VERIFICATION_SPEC.md criteria
    • Ensures no TODOs or incomplete work in slice
  2. Run compound-engineer subagent to extract learnings

    • Mandatory invocation after every slice completion
    • Updates CLAUDE.md Learnings section (if durable patterns found)
    • May exit with "no update needed" (valid for routine work)
  3. Commit the slice changes

    • Follow commit message format in CLAUDE.md
    • Each slice gets its own atomic commit
  4. Continue to next slice (if more slices remain)

    • Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.


10. User Value Summary

Status: Generated

What Users Can Now Do

Users can now submit answers across integer, float, string, and list questions and get correct pass/fail outcomes even when answers differ in formatting, case, numeric representation, or list ordering.

How to Access/Test

Run uv run pytest tests/test_verifier.py tests/test_verifier_integration.py -v, or run uv run pytest tests/ -v for full regression coverage including end-to-end ANSWER handling through SQLEnvironment.step().

Demo

  • Command: uv run pytest tests/test_verifier_integration.py -v

Release Notes Snippet

Added type-aware answer verification so ANSWER correctness now supports numeric coercion, float tolerance, case-insensitive strings, and order-insensitive list matching.


11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

Summary

  • Implemented type-aware answer verification in environment answer handling by routing _handle_answer() through verify_answer().
  • Added integration coverage for typed answer paths and fallback behavior (tests/test_verifier_integration.py).
  • Archived F002 behavior delta into specs/behavior/sql-environment.md and captured durable learnings in docs/learnings/F002-*.md.

Validation

  • uv run pytest tests/test_verifier.py -v -> 34 passed
  • uv run pytest tests/test_smoke.py -v -> 25 passed
  • uv run pytest tests/test_verifier_integration.py -v -> 6 passed
  • uv run pytest tests/ -v -> 65 passed

Scope and Risk

  • Risk tier: Low
  • Security-sensitive changes: None
  • Scope creep: None (added integration tests to satisfy verification spec evidence requirements)

Ready Action

All steps completed. Run /commit-push-pr.

PR Created

https://github.com/hjerpe/sql-env/pull/7


Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

  • A step requires touching more than 3 files in unrelated areas
  • You need to introduce multiple new abstractions "just in case"
  • Verification cannot be made targeted and concrete
  • You discover new unknowns that change the plan materially
  • The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.


Human Checkpoint

Before handing to AI agent:

  • Interface specifications are complete
  • Data flow is accurate
  • Error handling is specified
  • Implementation order makes sense
  • VERIFICATION_SPEC.md has been generated

Questions:

  1. Should float tolerance be configurable per-question or fixed at 1%?
  2. Any additional answer_type values beyond the four specified?

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
  - gold_rows passed raw to verifier (not just formatted string)
  - Fallback to string comparison when answer_type is None/unknown
  - No external dependencies -- pure Python only
  - match/case dispatch, not class hierarchy

Specification completed: 2026-03-27 Approved by: [NAME/ROLE] Verification spec: VERIFICATION_SPEC.md Verification input: F002-VERIFICATION_INPUT.json Target agent: Claude Code