Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F002-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

27.9 kB

Implementation Specification

Change: F002 -- Answer Verification (multi-type comparison) Date: 2026-03-27 Research Summary: specs/F002-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived into specs/behavior/sql-environment.md

Plan Status:

Draft
Approved for Implementation
Implementation Complete
Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: When an agent submits ANSWER, the environment correctly determines if the answer matches the gold answer regardless of type (42 vs 42.0, 'Engineering' vs 'engineering', unordered lists).

Success Criteria:

Float comparison with tolerance handles rounding gracefully (95000.1 matches 95000)
List comparison ignores order: ['A','B'] matches ['B','A']
Clear pass/fail with no ambiguity

Avoid:

Correct answer rejected due to trivial formatting difference
Type coercion failures (agent says '42', gold is integer 42)

Out of Scope:

Table comparison (multi-column row overlap) -- deferred to post-MVP
Partial credit scoring -- binary pass/fail only at this layer
Changes to reward signal structure (F003 scope)

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

Target: 2 slices
Hard max: <= 10 steps total
Each step must end in: implement -> verify -> merge

Slice Definition

A slice is a vertical increment that delivers user-visible value or a safe internal capability.

Each slice must have:

Clear outcome
Minimal interface change
Merge criteria

Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

Status Icons

Step Status:

??? Not Started
? In Progress
? Completed
? Blocked/Failed

Result Outcome:

? Fully Successful (all tests passed, no issues)
?? Completed with Issues (needs follow-up)
? Failed/Blocked

1. Implementation Overview

Summary

Implement verify_answer() in server/verifier.py with type-aware comparison dispatching across four answer types (integer, float, string, list). Wire it into _handle_answer() in server/sql_environment.py, replacing the naive string comparison. Add gold_rows field to EpisodeContext so the verifier receives raw data for accurate list comparison. Fallback to string comparison when answer_type is missing.

Scope

In Scope:

verify_answer() public function with 4 type comparers
Private helpers: _normalize_value, _compare_integer, _compare_float, _compare_string, _compare_list
gold_rows field on EpisodeContext
Integration into _handle_answer()
Unit tests for all comparers and edge cases

Out of Scope:

Table comparison (multi-column)
Partial credit / dense reward (F003)
Changes to question data schema (answer_type already exists)
External dependencies (pure Python only)

1a. Execution Status

Progress: 4/4 steps complete Current Step: None (all implementation steps complete) Last Updated: 2026-03-27T22:33:12Z Latest Result: Fully Successful (all tests passed, no issues) Blockers: None

1b. Risk Assessment

Risk Tier: Low

High-Risk Indicators Present: (none apply)

Touches authentication or authorization logic
Handles payment processing or financial data
Manages secrets, API keys, or credentials
Processes untrusted user input (file uploads, external APIs)
Modifies privilege/permission systems

Security Review Required: No

Justification: Pure logic module that compares two values. No user input beyond agent's ANSWER string (already sanitized by action parsing). No I/O, no network, no secrets.

2. Change Manifest

Files to Create

File	Purpose
`tests/test_verifier.py`	Unit tests for all comparison types and edge cases

Files to Modify

File	Changes
`server/verifier.py`	Replace stub with full `verify_answer()` + private helpers
`models.py`	Add `gold_rows: list[tuple]
`server/sql_environment.py`	Wire `verify_answer()` into `_handle_answer()`, populate `gold_rows`

Files to Delete

None.

3. Interface Specifications

Modified Types

# Location: models.py
# CHANGE: Add gold_rows field to EpisodeContext

@dataclass
class EpisodeContext:
    """Per-episode server-side state (never sent to agent)."""

    episode_id: str
    db_connection: sqlite3.Connection
    question_record: QuestionRecord
    step_count: int = 0
    budget: int = 15
    described_tables: set[str] = dataclass_field(default_factory=set)
    action_log: list[str] = dataclass_field(default_factory=list)
    done: bool = False
    gold_answer: str | None = None
    gold_rows: list[tuple] | None = None  # NEW: raw SQL result rows for verifier

New Functions

# Location: server/verifier.py

def verify_answer(
    predicted: str,
    gold: str,
    answer_type: str | None = None,
    gold_rows: list[tuple] | None = None,
) -> bool:
    """
    Compare agent's submitted answer against the gold answer.

    Dispatches to type-specific comparers based on answer_type.
    Falls back to string comparison when answer_type is None or unknown.

    Args:
        predicted: The agent's submitted answer string.
        gold: The gold answer as a formatted string.
        answer_type: One of "integer", "float", "string", "list", or None.
        gold_rows: Raw SQL result rows (list of tuples) for accurate list comparison.

    Returns:
        True if the answer is correct, False otherwise.
    """

# Location: server/verifier.py (private helpers)

def _normalize_value(value: str) -> str:
    """Strip whitespace and lowercase a value for comparison."""

def _compare_integer(predicted: str, gold: str) -> bool:
    """
    Compare as integers after coercing both sides.

    Handles: "42" vs 42, "42.0" vs 42.
    Returns False on ValueError (non-numeric input).
    """

def _compare_float(predicted: str, gold: str, tolerance: float = 0.01) -> bool:
    """
    Compare as floats with relative tolerance (default 1%).

    Uses: abs(pred - gold) <= tolerance * abs(gold) when gold != 0.
    For gold == 0: uses absolute tolerance of 1e-9.
    Returns False on ValueError.
    """

def _compare_string(predicted: str, gold: str) -> bool:
    """Case-insensitive, whitespace-normalized string comparison."""

def _compare_list(
    predicted: str,
    gold: str,
    gold_rows: list[tuple] | None = None,
) -> bool:
    """
    Order-insensitive set comparison.

    If gold_rows is provided, converts both sides to sets of normalized strings.
    Otherwise parses the formatted string (split on ' | ' and newlines).
    """

Modified Functions

# Location: server/sql_environment.py
# CHANGE: Replace naive comparison with verify_answer() call

def _handle_answer(self, value: str) -> tuple[bool, float]:
    """Compare submitted answer against episode gold answer using type-aware verifier."""
    if self._episode is None:
        raise RuntimeError("No active episode. Call reset() before step().")

    is_correct = verify_answer(
        predicted=value,
        gold=self._episode.gold_answer or "",
        answer_type=self._episode.question_record.answer_type,
        gold_rows=self._episode.gold_rows,
    )
    self._episode.done = True
    return is_correct, 1.0 if is_correct else 0.0

4. Data Flow

Primary Flow

1. Agent sends ANSWER action with value string
   - Input: action.argument (str)

2. step() dispatches to _handle_answer(value)
   - Input: value (str)

3. _handle_answer() calls verify_answer(predicted, gold, answer_type, gold_rows)
   - predicted: value (agent's answer)
   - gold: self._episode.gold_answer (formatted string)
   - answer_type: self._episode.question_record.answer_type
   - gold_rows: self._episode.gold_rows (raw tuples or None)

4. verify_answer() dispatches by answer_type:
   - "integer" -> _compare_integer(predicted, gold)
   - "float"   -> _compare_float(predicted, gold)
   - "string"  -> _compare_string(predicted, gold)
   - "list"    -> _compare_list(predicted, gold, gold_rows)
   - None/unknown -> _compare_string(predicted, gold)

5. Returns bool -> _handle_answer returns (bool, float reward)

Alternative Flows

When answer_type is None or unknown:

1. verify_answer receives answer_type=None
2. Falls back to _compare_string(predicted, gold)
3. Returns bool (case-insensitive normalized comparison)

When predicted or gold is empty/None:

1. verify_answer receives empty string or None-coerced value
2. Returns False immediately (no valid answer to compare)

When type coercion fails (e.g., "abc" as integer):

1. _compare_integer or _compare_float catches ValueError
2. Falls back to returning False

5. Error Handling

Error Types

Error	When	Behavior
`ValueError` (caught internally)	Predicted value cannot be coerced to int/float	Return False (not correct)
`RuntimeError`	`_handle_answer` called with no active episode	Raised to caller (existing behavior)

Error Handling Strategy

# Pattern: catch coercion errors, return False (answer is wrong, not a crash)
def _compare_integer(predicted: str, gold: str) -> bool:
    try:
        return int(float(predicted)) == int(float(gold))
    except (ValueError, TypeError):
        return False

Retry Strategy

Operation	Retry?	Strategy
`verify_answer()`	No	Deterministic comparison, no transient failures

6. Slice Plan (What we will ship, in order)

Slice S1 -- Core Verifier Module

Value: verify_answer() exists as a tested, standalone module with all 4 type comparers User-visible change: No (not yet wired in) Interfaces introduced/changed: verify_answer(), _normalize_value(), _compare_integer(), _compare_float(), _compare_string(), _compare_list() Rollback safety: Additive only -- new file, no existing code changed

Slice S2 -- Integration and Wiring

Value: _handle_answer() uses type-aware verification; agents get correct results for float/list/integer answers User-visible change: Yes -- agent answers previously rejected (e.g., "42" vs integer 42) now accepted Interfaces introduced/changed: EpisodeContext.gold_rows, modified _handle_answer() Rollback safety: Revert to naive string compare by removing import and restoring 3 lines

7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Implement verify_answer module

Slice: S1 Goal: Create the complete verify_answer() function with all 4 type-specific comparers in server/verifier.py.

Files:

server/verifier.py - modify - Replace stub with full implementation

Interface Changes:

New public function: verify_answer(predicted, gold, answer_type, gold_rows) -> bool
New private helpers: _normalize_value, _compare_integer, _compare_float, _compare_string, _compare_list

Implementation Details:

Replace the docstring-only stub in server/verifier.py with the full module.
verify_answer() uses match/case on answer_type to dispatch.
_normalize_value(value): value.strip().lower().
_compare_integer(pred, gold): coerce both via int(float(x)), exact match. Catch ValueError -> False.
_compare_float(pred, gold, tolerance=0.01): relative tolerance abs(p - g) <= tol * abs(g). For g==0, absolute tolerance 1e-9. Catch ValueError -> False.
_compare_string(pred, gold): _normalize_value(pred) == _normalize_value(gold).
_compare_list(pred, gold, gold_rows): If gold_rows is provided, build gold set from {str(cell) for row in gold_rows for cell in row}. Parse predicted by splitting on , and \n. Normalize both sides, compare as sets. If no gold_rows, parse gold string by splitting on | and \n.
Guard: if predicted is empty after strip, return False immediately.

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T22:18:15Z Changes Made:

server/verifier.py - replaced stub content with verify_answer() and helper comparers for integer, float, string, and list handling.

Result:

Outcome: Fully Successful

Evidence Captured:

uv run --extra dev pytest tests/ -v
======================== 25 passed in 81.43s =========================

Tests run: uv run --extra dev pytest tests/ -v
Notes:
- Implemented verify_answer() dispatch with fallback to normalized string comparison for unknown or missing answer types.
- Added deterministic helper behavior: integer coercion via int(float(x)), float relative tolerance (1%), and list set comparison.
- Used uv run --extra dev because local environment did not yet include pytest from dev extras.
Issues: None | [short bullet list if any]
Follow-ups Created: None | [list of new step IDs if issues spawned new steps]
Human Review Completed: N/A

Context for Next Step:

Add tests/test_verifier.py coverage for dispatcher paths, comparer edge cases, and fallback logic from specs/F002-VERIFICATION_SPEC.md.

Step 1.2: Unit tests for verifier

Slice: S1 Goal: Create comprehensive unit tests covering all 4 answer types, edge cases, and the fallback path.

Files:

tests/test_verifier.py - create - Unit tests for verify_answer and all comparers

Interface Changes: None (test-only)

Implementation Details:

Test _compare_integer: "42" vs "42", "42.0" vs "42", "abc" vs "42" (False), "" vs "42" (False).
Test _compare_float: "95000.1" vs "95000" (True, within 1%), "100" vs "200" (False), "0" vs "0" (True), "abc" vs "1.0" (False).
Test _compare_string: "Engineering" vs "engineering" (True), " hello " vs "hello" (True), "a" vs "b" (False).
Test _compare_list: "A, B" vs "B, A" (True), "A" vs "A, B" (False), test with gold_rows provided.
Test verify_answer dispatch: each type routes correctly, None/unknown falls back to string.
Test edge cases: empty predicted (False), None gold coerced to "" (False).

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T22:21:30Z Changes Made:

tests/test_verifier.py - created comprehensive unit coverage for verifier dispatch and helper comparers across integer, float, string, and list cases.

Result:

Outcome: Fully Successful

Evidence Captured:

uv run pytest tests/test_verifier.py -v
============================== 31 passed in 6.19s ==============================

Tests run: uv run pytest tests/test_verifier.py -v
Notes:
- Added dispatcher tests for all answer types plus fallback and empty-predicted guards.
- Added comparer edge-case tests (int truncation, float tolerance boundaries, list parsing with/without gold_rows).
- Kept coverage aligned to existing verifier behavior (normalized whitespace/case comparison).
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Add gold_rows to EpisodeContext in models.py and persist raw gold query rows during reset() in server/sql_environment.py.

Step 2.1: Add gold_rows to EpisodeContext and populate during reset

Slice: S2 Goal: Add gold_rows field to EpisodeContext and populate it when an episode is reset (alongside gold_answer).

Files:

models.py - modify - Add gold_rows: list[tuple] | None = None to EpisodeContext
server/sql_environment.py - modify - Populate gold_rows during episode reset where gold_answer is set

Interface Changes:

EpisodeContext.gold_rows: list[tuple] | None = None (new field)

Implementation Details:

Add gold_rows: list[tuple] | None = None to EpisodeContext dataclass after gold_answer.
In sql_environment.py, find where gold_answer is populated during reset(). At the same location, store the raw rows in gold_rows before they are formatted.

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T22:24:54Z Changes Made:

models.py - added gold_rows: list[tuple] | None = None to EpisodeContext.
server/sql_environment.py - persisted raw gold query rows into EpisodeContext.gold_rows during reset().
tests/test_verifier.py - added EpisodeContext.gold_rows unit tests (default None, populated list, empty list).

Result:

Outcome: Fully Successful

Evidence Captured:

uv run pytest tests/test_verifier.py -v
============================== 34 passed in 6.18s ==============================

Tests run: uv run pytest tests/test_verifier.py -v
Notes:
- Stored structured gold_rows at reset-time where gold SQL is already executed, so no extra SQL execution path was introduced.
- Added direct dataclass tests for EpisodeContext.gold_rows to satisfy verification criteria for the new interface field.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Replace _handle_answer() naive normalized string equality with verify_answer(predicted, gold, answer_type, gold_rows) and keep terminal reward mapping unchanged.

Step 2.2: Wire verify_answer into _handle_answer

Slice: S2 Goal: Replace naive string comparison in _handle_answer() with verify_answer() call.

Files:

server/sql_environment.py - modify - Import and call verify_answer() in _handle_answer()

Interface Changes:

Modified function: _handle_answer() now delegates to verify_answer()

Implementation Details:

Add import: from server.verifier import verify_answer at top of sql_environment.py.
Replace the body of _handle_answer():
- Remove: submitted = value.strip().lower() / expected = ... / is_correct = submitted == expected
- Add: is_correct = verify_answer(predicted=value, gold=self._episode.gold_answer or "", answer_type=self._episode.question_record.answer_type, gold_rows=self._episode.gold_rows)
Keep: self._episode.done = True and return is_correct, 1.0 if is_correct else 0.0
Run existing smoke tests to confirm no regressions.

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)
Existing 25 smoke tests still pass

Status: Completed

Completed: 2026-03-27T22:33:12Z Changes Made:

server/sql_environment.py - imported verify_answer and replaced _handle_answer() naive normalized-string equality with verify_answer(predicted, gold, answer_type, gold_rows).
tests/test_verifier_integration.py - added integration coverage for integer/float/string/list answer flows, fallback behavior for missing answer_type, and numeric coercion failure path.

Result:

Outcome: Fully Successful

Evidence Captured:

uv run pytest tests/test_verifier.py -v
============================== 34 passed in 6.64s ==============================

uv run pytest tests/test_smoke.py -v
============================== 25 passed in 6.53s ==============================

uv run pytest tests/test_verifier_integration.py -v
============================== 6 passed in 6.65s ==============================

uv run pytest tests/ -v
============================== 65 passed in 6.62s ==============================

Tests run: uv run pytest tests/test_verifier.py -v; uv run pytest tests/test_smoke.py -v; uv run pytest tests/test_verifier_integration.py -v; uv run pytest tests/ -v
Notes:
- _handle_answer() now uses a single verifier dispatch path, keeping answer comparison logic centralized in server/verifier.py.
- Added integration tests because VERIFICATION_SPEC.md expected tests/test_verifier_integration.py evidence.
- Behavior delta was archived into specs/behavior/sql-environment.md and the delta file was removed.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Implementation complete. Proceed with commit/PR workflow (/commit-push-pr) for F002.

8. Rollout Considerations

Feature Flags

Required: No
Flag name: N/A

Migration

Data migration needed: No
Migration strategy: N/A

Rollback Plan

Revert _handle_answer() to inline string comparison (3 lines). The verify_answer() module and gold_rows field are additive and harmless if unused.

9. Execution Tracking

All execution state is tracked within this document:

Section 1a: Overall progress summary
Section 7: Per-step completion details, test results, and handoff context
FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

Checking Section 1a for summary
Reviewing Section 7 for detailed step status
Inspecting the feature's progress and status fields in FEATURES.json
Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.

10. User Value Summary

Status: Generated

What Users Can Now Do

Users can now submit answers across integer, float, string, and list questions and get correct pass/fail outcomes even when answers differ in formatting, case, numeric representation, or list ordering.

How to Access/Test

Run uv run pytest tests/test_verifier.py tests/test_verifier_integration.py -v, or run uv run pytest tests/ -v for full regression coverage including end-to-end ANSWER handling through SQLEnvironment.step().

Demo

Command: uv run pytest tests/test_verifier_integration.py -v

Release Notes Snippet

Added type-aware answer verification so ANSWER correctness now supports numeric coercion, float tolerance, case-insensitive strings, and order-insensitive list matching.

11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

Summary

Implemented type-aware answer verification in environment answer handling by routing _handle_answer() through verify_answer().
Added integration coverage for typed answer paths and fallback behavior (tests/test_verifier_integration.py).
Archived F002 behavior delta into specs/behavior/sql-environment.md and captured durable learnings in docs/learnings/F002-*.md.

Validation

uv run pytest tests/test_verifier.py -v -> 34 passed
uv run pytest tests/test_smoke.py -v -> 25 passed
uv run pytest tests/test_verifier_integration.py -v -> 6 passed
uv run pytest tests/ -v -> 65 passed

Scope and Risk

Risk tier: Low
Security-sensitive changes: None
Scope creep: None (added integration tests to satisfy verification spec evidence requirements)

Ready Action

All steps completed. Run /commit-push-pr.

PR Created

https://github.com/hjerpe/sql-env/pull/7

Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

A step requires touching more than 3 files in unrelated areas
You need to introduce multiple new abstractions "just in case"
Verification cannot be made targeted and concrete
You discover new unknowns that change the plan materially
The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.

Human Checkpoint

Before handing to AI agent:

Interface specifications are complete
Data flow is accurate
Error handling is specified
Implementation order makes sense
VERIFICATION_SPEC.md has been generated

Questions:

Should float tolerance be configurable per-question or fixed at 1%?
Any additional answer_type values beyond the four specified?

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
  - gold_rows passed raw to verifier (not just formatted string)
  - Fallback to string comparison when answer_type is None/unknown
  - No external dependencies -- pure Python only
  - match/case dispatch, not class hierarchy

Specification completed: 2026-03-27 Approved by: [NAME/ROLE] Verification spec: VERIFICATION_SPEC.md Verification input: F002-VERIFICATION_INPUT.json Target agent: Claude Code