sql_env / specs /F005-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Implementation Specification

Change: F005 -- Green Agent Wrapper (automated evaluation) Date: 2026-03-27 Research Summary: specs/F005-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Spec: Archived to specs/behavior/evaluation.md

Plan Status:

  • Draft
  • Approved for Implementation
  • Implementation Complete
  • Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained).

Success Criteria:

  • Single function call: evaluate(n_episodes=100) returns clean metrics dict
  • Built-in random policy for instant baseline comparison
  • Results include per-episode breakdown for analysis

Avoid:

  • Evaluation crashes partway through and loses all results
  • No progress indicator for long evaluation runs

Out of Scope:

  • Visualization / plotting of results
  • WebSocket / remote environment support (local SQLEnvironment only)
  • Elaborate policy class hierarchy
  • Training loop integration (F006 will consume this API)

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

  • Target: 2 slices
  • Hard max: <= 10 steps total
  • Each step must end in: implement -> verify -> merge

Slice Definition

A slice is a vertical increment that delivers user-visible value or a safe internal capability.

Each slice must have:

  • Clear outcome
  • Minimal interface change
  • Merge criteria

Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

Status Icons

Step Status:

  • ??? Not Started
  • ???? In Progress
  • ??? Completed
  • ???? Blocked/Failed

Result Outcome:

  • ??? Fully Successful (all tests passed, no issues)
  • ?????? Completed with Issues (needs follow-up)
  • ???? Failed/Blocked

1. Implementation Overview

Summary

Create an evaluation/ subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a Policy protocol defining the interface for any policy, (2) an EpisodeResult dataclass for per-episode metrics, (3) an EvaluationResult dataclass for aggregate metrics, (4) a RandomPolicy class as a built-in baseline, and (5) an evaluate() function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified.

Scope

In Scope:

  • evaluation/__init__.py -- public API re-exports
  • evaluation/green_agent.py -- Protocol, dataclasses, RandomPolicy, evaluate()
  • tests/test_evaluation.py -- unit + integration tests

Out of Scope:

  • Modifications to server/sql_environment.py or models.py
  • CLI entry point (future feature)
  • Remote / WebSocket evaluation
  • Plotting or visualization

1a. Execution Status

Progress: 4/4 steps complete Current Step: All planned implementation steps are complete Last Updated: 2026-03-28T00:04:03Z Latest Result: Fully Successful (Step 2.2 complete) Blockers: None


1b. Risk Assessment

Risk Tier: [x] Low | [ ] Medium | [ ] High

High-Risk Indicators Present: (check all that apply if tier is High)

  • Touches authentication or authorization logic
  • Handles payment processing or financial data
  • Manages secrets, API keys, or credentials
  • Processes untrusted user input (file uploads, external APIs)
  • Modifies privilege/permission systems

Security Review Required: [ ] Yes (if High) | [x] No

Justification: Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns.


2. Change Manifest

Files to Create

File Purpose
evaluation/__init__.py Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate
evaluation/green_agent.py Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate()
tests/test_evaluation.py Unit tests for types + RandomPolicy, integration test with SQLEnvironment

Files to Modify

None.

Files to Delete

None.


3. Interface Specifications

New Types

# Location: evaluation/green_agent.py

from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable

@runtime_checkable
class Policy(Protocol):
    """Interface for any evaluation policy.

    Any object with a select_action method matching this signature
    is a valid policy (structural subtyping / duck typing).
    """

    def select_action(self, observation: SQLObservation) -> SQLAction:
        """Choose an action given an observation."""
        ...


@dataclass(frozen=True)
class EpisodeResult:
    """Per-episode evaluation metrics."""

    episode_index: int          # 0-based episode number
    correct: bool               # Whether ANSWER action matched gold
    total_reward: float         # Cumulative reward for the episode
    steps: int                  # Number of steps taken
    error: str | None = None    # Error message if episode failed


@dataclass(frozen=True)
class EvaluationResult:
    """Aggregate evaluation metrics with per-episode breakdown."""

    success_rate: float                 # Fraction of correct episodes [0.0, 1.0]
    avg_reward: float                   # Mean total_reward across episodes
    avg_steps: float                    # Mean steps across episodes
    n_episodes: int                     # Number of episodes attempted
    n_completed: int                    # Episodes that ran to completion (no error)
    episodes: list[EpisodeResult]       # Per-episode breakdown

New Functions

# Location: evaluation/green_agent.py

class RandomPolicy:
    """Built-in random baseline policy.

    Selects random action types and arguments. Deterministic given a seed.
    """

    def __init__(self, seed: int | None = None) -> None:
        """
        Args:
            seed: Random seed for reproducibility. None = non-deterministic.
        """

    def select_action(self, observation: SQLObservation) -> SQLAction:
        """Pick a random action based on current observation.

        Strategy:
        - If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY
        - If budget_remaining == 1: always ANSWER with a random guess
        - DESCRIBE/SAMPLE: pick a random table from schema_info
        - QUERY: generate a simple SELECT * FROM <table> LIMIT 5
        - ANSWER: pick a random value from last result or "unknown"

        Args:
            observation: Current environment observation

        Returns:
            A random SQLAction
        """


def evaluate(
    env: SQLEnvironment,
    policy: Policy,
    n_episodes: int = 100,
    *,
    seed: int | None = None,
    progress_callback: Callable[[int, int], None] | None = None,
) -> EvaluationResult:
    """Run automated evaluation of a policy over multiple episodes.

    Collects results incrementally -- if an episode fails, it is recorded
    as an error and evaluation continues with the next episode.

    Args:
        env: The SQLEnvironment instance to evaluate against.
        policy: Any object satisfying the Policy protocol.
        n_episodes: Number of episodes to run (0 returns empty result).
        seed: Base seed for reproducibility. Episode i uses seed+i.
        progress_callback: Optional callback(current, total) for progress.

    Returns:
        EvaluationResult with aggregate metrics and per-episode breakdown.

    Raises:
        ValueError: If n_episodes < 0.
    """

4. Data Flow

Primary Flow

1. evaluate(env, policy, n_episodes=100, seed=42)
   - Input: environment, policy, episode count, optional seed

2. For each episode i in range(n_episodes):
   a. obs = env.reset(seed=seed+i if seed else None)
   b. While not obs.done:
      - action = policy.select_action(obs)
      - obs = env.step(action)
      - Accumulate reward
   c. Record EpisodeResult(correct=..., total_reward=..., steps=...)
   d. Call progress_callback(i+1, n_episodes) if provided

3. Aggregate results:
   - success_rate = sum(correct) / n_completed
   - avg_reward = mean(total_reward) across completed
   - avg_steps = mean(steps) across completed

4. Return EvaluationResult

Alternative Flows

When n_episodes=0:

1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0,
     avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])

When episode raises exception:

1. Catch exception in the episode loop
2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0,
     error=str(exception))
3. Continue to next episode

When env.reset() fails:

1. Catch exception
2. Record EpisodeResult with error, steps=0
3. Continue to next episode

5. Error Handling

Error Types

Error When Handling
ValueError n_episodes < 0 Raise immediately
Exception during env.reset() DB not found, bad questions file Catch, record as failed episode, continue
Exception during policy.select_action() Policy bug Catch, record as failed episode, continue
Exception during env.step() Environment bug Catch, record as failed episode, continue

Error Handling Strategy

# Pattern: incremental collection with per-episode error isolation
for i in range(n_episodes):
    try:
        obs = env.reset(seed=episode_seed)
        total_reward = 0.0
        steps = 0
        while not obs.done:
            action = policy.select_action(obs)
            obs = env.step(action)
            total_reward += obs.reward or 0.0
            steps += 1
        episodes.append(EpisodeResult(
            episode_index=i,
            correct=_check_correct(obs),
            total_reward=total_reward,
            steps=steps,
        ))
    except Exception as exc:
        episodes.append(EpisodeResult(
            episode_index=i,
            correct=False,
            total_reward=0.0,
            steps=0,
            error=str(exc),
        ))

Retry Strategy

Operation Retry? Strategy
Episode evaluation No Record error, move to next episode
Environment reset No Record error, move to next episode

6. Slice Plan (What we will ship, in order)

Slice S1 -- Types, Protocol, and RandomPolicy

Value: Establishes the evaluation interface and provides a usable random baseline User-visible change: Yes -- users can instantiate RandomPolicy and call select_action Interfaces introduced/changed: Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy Rollback safety: Purely additive -- new files only, no changes to existing code

Slice S2 -- evaluate() Function and Integration Test

Value: Users can run evaluate(env, random_policy, n_episodes=100) and get structured metrics User-visible change: Yes -- the core capability is now available Interfaces introduced/changed: evaluate() function Rollback safety: Purely additive -- extends S1 files, no changes to existing code


7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Types and Protocol

Slice: S1 Goal: Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass.

Files:

  • evaluation/__init__.py - create - empty init with re-exports
  • evaluation/green_agent.py - create - Protocol + dataclasses (no functions yet)

Interface Changes:

  • New: Policy protocol with select_action(observation: SQLObservation) -> SQLAction
  • New: EpisodeResult frozen dataclass
  • New: EvaluationResult frozen dataclass

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: ??? Completed

Completed: 2026-03-27T23:51:09Z Changes Made:

  • Created evaluation/__init__.py with public re-exports for Policy, EpisodeResult, and EvaluationResult.
  • Created evaluation/green_agent.py with the Policy runtime-checkable protocol and frozen EpisodeResult/EvaluationResult dataclasses.

Result:

  • Outcome: ???
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/ -v
    Result: 100 passed, 1 skipped
    Scope: full project regression run after adding new evaluation types
    
  • Tests run: uv run --with pytest pytest tests/ -v
  • Notes:
    • Dataclass and protocol scaffolding is additive and isolated to a new package.
    • pytest is not installed in the project environment yet, so verification used uv run --with pytest for this step.
    • Import fallback mirrors existing package-vs-standalone test collection behavior in the repo.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: ??? N/A

Context for Next Step:

  • Types are defined and importable from evaluation

Step 1.2: RandomPolicy Implementation

Slice: S1 Goal: Implement the RandomPolicy class that selects random actions based on observation state.

Files:

  • evaluation/green_agent.py - modify - add RandomPolicy class

Interface Changes:

  • New: RandomPolicy.__init__(seed: int | None = None)
  • New: RandomPolicy.select_action(observation: SQLObservation) -> SQLAction

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: ??? Completed

Completed: 2026-03-27T23:55:10Z Changes Made:

  • Implemented RandomPolicy in evaluation/green_agent.py with seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction.
  • Updated evaluation/__init__.py to re-export RandomPolicy from the public evaluation API.
  • Added tests/test_evaluation.py with focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction).

Result:

  • Outcome: ???
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/test_evaluation.py -v
    Result: 6 passed
    Scope: RandomPolicy unit coverage for F005 Step 1.2
    
    Command: uv run --with pytest pytest tests/ -v
    Result: 106 passed, 1 skipped
    Scope: Full regression after RandomPolicy implementation
    
  • Tests run: uv run --with pytest pytest tests/test_evaluation.py -v; uv run --with pytest pytest tests/ -v
  • Notes:
    • RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step.
    • Schema parsing intentionally handles both - table and - table: columns... observation formats.
    • Verification commands in the spec referenced tests/unit/...; this repo uses a flat tests/ layout, so tests were added in tests/test_evaluation.py.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: ??? N/A

Context for Next Step:

  • RandomPolicy is implemented and exported from the public evaluation API
  • Ready to implement evaluate() using per-episode loop and error isolation

Step 2.1: evaluate() Function

Slice: S2 Goal: Implement the core evaluate() function with incremental collection and error isolation.

Files:

  • evaluation/green_agent.py - modify - add evaluate() function
  • evaluation/__init__.py - modify - add evaluate to re-exports

Interface Changes:

  • New: evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: ??? Completed

Completed: 2026-03-27T23:59:28Z Changes Made:

  • Added evaluate() to evaluation/green_agent.py with per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation.
  • Added evaluate to evaluation/__init__.py public exports.
  • Extended tests/test_evaluation.py with unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior.

Result:

  • Outcome: ???
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/test_evaluation.py -v
    Result: 14 passed
    Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1
    
    Command: uv run --with pytest pytest tests/ -v
    Result: 114 passed, 1 skipped
    Scope: Full regression after evaluate() implementation
    
  • Tests run: uv run --with pytest pytest tests/test_evaluation.py -v; uv run --with pytest pytest tests/ -v
  • Notes:
    • evaluate() computes aggregates using completed episodes only (error is None), matching the error-isolation behavior in the spec data flow.
    • Progress callback is invoked once per attempted episode, including episodes that fail.
    • Repository environment still does not include pytest by default, so verification used uv run --with pytest.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: ??? N/A

Context for Next Step:

  • evaluate() is implemented, exported, and covered by focused unit tests
  • Next step should add/expand integration coverage with a real SQLEnvironment evaluation run

Step 2.2: Integration Test with SQLEnvironment

Slice: S2 Goal: Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment.

Files:

  • tests/test_evaluation.py - create - unit tests for types + RandomPolicy + evaluate(); integration test with real env

Interface Changes: None (test-only step).

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T00:04:03Z Changes Made:

  • Added _build_sql_environment() test helper in tests/test_evaluation.py to spin up a real SQLite-backed SQLEnvironment with a deterministic question fixture.
  • Added test_evaluate_integration_with_sql_environment validating end-to-end evaluate() execution over 10 episodes with aggregate-metric consistency checks.
  • Added test_evaluate_integration_is_deterministic_with_seeds validating deterministic full-result equality when both policy and environment seeds are fixed.

Result:

  • Outcome: Fully Successful
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/test_evaluation.py -v
    Result: 16 passed
    Scope: evaluation unit + integration coverage including real SQLEnvironment flow
    
    Command: uv run --with pytest pytest tests/ -v
    Result: 116 passed, 1 skipped
    Scope: full project regression after adding integration coverage
    
  • Tests run: uv run --with pytest pytest tests/test_evaluation.py -v; uv run --with pytest pytest tests/ -v
  • Notes:
  • Integration tests were implemented in tests/test_evaluation.py to match this repository's flat test layout.
  • Verifier gate approved finalization in MVP mode after test evidence review.
  • Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • All implementation steps are complete and verification gate passed.

8. Rollout Considerations

Feature Flags

  • Required: No
  • Flag name: N/A

Migration

  • Data migration needed: No
  • Migration strategy: N/A

Rollback Plan

Delete the evaluation/ directory. No other code references it.


9. Execution Tracking

All execution state is tracked within this document:

  • Section 1a: Overall progress summary
  • Section 7: Per-step completion details, test results, and handoff context
  • FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
  • Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

  • Checking Section 1a for summary
  • Reviewing Section 7 for detailed step status
  • Inspecting the feature's progress and status fields in FEATURES.json
  • Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

  1. Run verifier subagent for spec compliance

    • Validates against VERIFICATION_SPEC.md criteria
    • Ensures no TODOs or incomplete work in slice
  2. Run compound-engineer subagent to extract learnings

    • Mandatory invocation after every slice completion
    • Updates CLAUDE.md Learnings section (if durable patterns found)
    • May exit with "no update needed" (valid for routine work)
  3. Commit the slice changes

    • Follow commit message format in CLAUDE.md
    • Each slice gets its own atomic commit
  4. Continue to next slice (if more slices remain)

    • Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.


10. User Value Summary

Status: Generated

What Users Can Now Do

Run automated evaluation of any policy over N episodes with evaluate(env, policy, n_episodes=100) and get structured metrics including success rate, average reward, average steps, and per-episode breakdown.

How to Access/Test

from evaluation import evaluate, RandomPolicy
from server.sql_environment import SQLEnvironment

env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer)
policy = RandomPolicy(seed=42)
result = evaluate(env, policy, n_episodes=10, seed=42)
print(f"Success rate: {result.success_rate:.1%}")
print(f"Avg reward: {result.avg_reward:.3f}")

Demo

  • Command: uv run python -c "from evaluation import evaluate, RandomPolicy; ..."

Release Notes Snippet

Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance.


11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

PR Title

feat(evaluation): complete green agent wrapper integration and finalization

PR Summary

  • Add deterministic integration coverage for evaluate() against a real SQLEnvironment fixture.
  • Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation.
  • Capture durable learnings under docs/learnings/ for evaluation patterns and deterministic testing.

Verification

  • uv run --with pytest pytest tests/test_evaluation.py -v
  • uv run --with pytest pytest tests/ -v

Follow-up

All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10


Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

  • A step requires touching more than 3 files in unrelated areas
  • You need to introduce multiple new abstractions "just in case"
  • Verification cannot be made targeted and concrete
  • You discover new unknowns that change the plan materially
  • The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.


Human Checkpoint

Before handing to AI agent:

  • Interface specifications are complete
  • Data flow is accurate
  • Error handling is specified
  • Implementation order makes sense
  • VERIFICATION_SPEC.md has been generated

Questions:

  1. Any remaining concerns?
  2. Anything agent should know?

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly

Specification completed: 2026-03-27 Approved by: -- Verification spec: VERIFICATION_SPEC.md Verification input: F005-VERIFICATION_INPUT.json Target agent: Claude Code