# Implementation Specification
**Change:** F005 -- Green Agent Wrapper (automated evaluation)
**Date:** 2026-03-27
**Research Summary:** [specs/F005-RESEARCH_SUMMARY.md](F005-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Spec:** Archived to [specs/behavior/evaluation.md](behavior/evaluation.md)
**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed
---
## Core Intent (Immutable)
> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you're describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.
**User Problem:**
Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained).
**Success Criteria:**
- Single function call: `evaluate(n_episodes=100)` returns clean metrics dict
- Built-in random policy for instant baseline comparison
- Results include per-episode breakdown for analysis
**Avoid:**
- Evaluation crashes partway through and loses all results
- No progress indicator for long evaluation runs
**Out of Scope:**
- Visualization / plotting of results
- WebSocket / remote environment support (local SQLEnvironment only)
- Elaborate policy class hierarchy
- Training loop integration (F006 will consume this API)
---
## 0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in **small, mergeable increments**.
### Scope Budget
- Target: **2 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**
### Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.
**Each slice must have:**
- Clear outcome
- Minimal interface change
- Merge criteria
**Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).
## Status Icons
**Step Status:**
- ??? Not Started
- ???? In Progress
- ??? Completed
- ???? Blocked/Failed
**Result Outcome:**
- ??? Fully Successful (all tests passed, no issues)
- ?????? Completed with Issues (needs follow-up)
- ???? Failed/Blocked
---
## 1. Implementation Overview
### Summary
Create an `evaluation/` subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a `Policy` protocol defining the interface for any policy, (2) an `EpisodeResult` dataclass for per-episode metrics, (3) an `EvaluationResult` dataclass for aggregate metrics, (4) a `RandomPolicy` class as a built-in baseline, and (5) an `evaluate()` function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified.
### Scope
**In Scope:**
- `evaluation/__init__.py` -- public API re-exports
- `evaluation/green_agent.py` -- Protocol, dataclasses, RandomPolicy, evaluate()
- `tests/test_evaluation.py` -- unit + integration tests
**Out of Scope:**
- Modifications to `server/sql_environment.py` or `models.py`
- CLI entry point (future feature)
- Remote / WebSocket evaluation
- Plotting or visualization
---
## 1a. Execution Status
**Progress:** 4/4 steps complete
**Current Step:** All planned implementation steps are complete
**Last Updated:** 2026-03-28T00:04:03Z
**Latest Result:** Fully Successful (Step 2.2 complete)
**Blockers:** None
---
## 1b. Risk Assessment
**Risk Tier:** [x] Low | [ ] Medium | [ ] High
**High-Risk Indicators Present:** (check all that apply if tier is High)
- [ ] Touches authentication or authorization logic
- [ ] Handles payment processing or financial data
- [ ] Manages secrets, API keys, or credentials
- [ ] Processes untrusted user input (file uploads, external APIs)
- [ ] Modifies privilege/permission systems
**Security Review Required:** [ ] Yes (if High) | [x] No
**Justification:**
Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns.
---
## 2. Change Manifest
### Files to Create
| File | Purpose |
|------|---------|
| `evaluation/__init__.py` | Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate |
| `evaluation/green_agent.py` | Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate() |
| `tests/test_evaluation.py` | Unit tests for types + RandomPolicy, integration test with SQLEnvironment |
### Files to Modify
None.
### Files to Delete
None.
---
## 3. Interface Specifications
### New Types
```python
# Location: evaluation/green_agent.py
from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable
@runtime_checkable
class Policy(Protocol):
"""Interface for any evaluation policy.
Any object with a select_action method matching this signature
is a valid policy (structural subtyping / duck typing).
"""
def select_action(self, observation: SQLObservation) -> SQLAction:
"""Choose an action given an observation."""
...
@dataclass(frozen=True)
class EpisodeResult:
"""Per-episode evaluation metrics."""
episode_index: int # 0-based episode number
correct: bool # Whether ANSWER action matched gold
total_reward: float # Cumulative reward for the episode
steps: int # Number of steps taken
error: str | None = None # Error message if episode failed
@dataclass(frozen=True)
class EvaluationResult:
"""Aggregate evaluation metrics with per-episode breakdown."""
success_rate: float # Fraction of correct episodes [0.0, 1.0]
avg_reward: float # Mean total_reward across episodes
avg_steps: float # Mean steps across episodes
n_episodes: int # Number of episodes attempted
n_completed: int # Episodes that ran to completion (no error)
episodes: list[EpisodeResult] # Per-episode breakdown
```
### New Functions
```python
# Location: evaluation/green_agent.py
class RandomPolicy:
"""Built-in random baseline policy.
Selects random action types and arguments. Deterministic given a seed.
"""
def __init__(self, seed: int | None = None) -> None:
"""
Args:
seed: Random seed for reproducibility. None = non-deterministic.
"""
def select_action(self, observation: SQLObservation) -> SQLAction:
"""Pick a random action based on current observation.
Strategy:
- If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY
- If budget_remaining == 1: always ANSWER with a random guess
- DESCRIBE/SAMPLE: pick a random table from schema_info
- QUERY: generate a simple SELECT * FROM
LIMIT 5
- ANSWER: pick a random value from last result or "unknown"
Args:
observation: Current environment observation
Returns:
A random SQLAction
"""
def evaluate(
env: SQLEnvironment,
policy: Policy,
n_episodes: int = 100,
*,
seed: int | None = None,
progress_callback: Callable[[int, int], None] | None = None,
) -> EvaluationResult:
"""Run automated evaluation of a policy over multiple episodes.
Collects results incrementally -- if an episode fails, it is recorded
as an error and evaluation continues with the next episode.
Args:
env: The SQLEnvironment instance to evaluate against.
policy: Any object satisfying the Policy protocol.
n_episodes: Number of episodes to run (0 returns empty result).
seed: Base seed for reproducibility. Episode i uses seed+i.
progress_callback: Optional callback(current, total) for progress.
Returns:
EvaluationResult with aggregate metrics and per-episode breakdown.
Raises:
ValueError: If n_episodes < 0.
"""
```
---
## 4. Data Flow
### Primary Flow
```
1. evaluate(env, policy, n_episodes=100, seed=42)
- Input: environment, policy, episode count, optional seed
2. For each episode i in range(n_episodes):
a. obs = env.reset(seed=seed+i if seed else None)
b. While not obs.done:
- action = policy.select_action(obs)
- obs = env.step(action)
- Accumulate reward
c. Record EpisodeResult(correct=..., total_reward=..., steps=...)
d. Call progress_callback(i+1, n_episodes) if provided
3. Aggregate results:
- success_rate = sum(correct) / n_completed
- avg_reward = mean(total_reward) across completed
- avg_steps = mean(steps) across completed
4. Return EvaluationResult
```
### Alternative Flows
**When n_episodes=0:**
```
1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0,
avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])
```
**When episode raises exception:**
```
1. Catch exception in the episode loop
2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0,
error=str(exception))
3. Continue to next episode
```
**When env.reset() fails:**
```
1. Catch exception
2. Record EpisodeResult with error, steps=0
3. Continue to next episode
```
---
## 5. Error Handling
### Error Types
| Error | When | Handling |
|-------|------|----------|
| `ValueError` | `n_episodes < 0` | Raise immediately |
| `Exception` during `env.reset()` | DB not found, bad questions file | Catch, record as failed episode, continue |
| `Exception` during `policy.select_action()` | Policy bug | Catch, record as failed episode, continue |
| `Exception` during `env.step()` | Environment bug | Catch, record as failed episode, continue |
### Error Handling Strategy
```python
# Pattern: incremental collection with per-episode error isolation
for i in range(n_episodes):
try:
obs = env.reset(seed=episode_seed)
total_reward = 0.0
steps = 0
while not obs.done:
action = policy.select_action(obs)
obs = env.step(action)
total_reward += obs.reward or 0.0
steps += 1
episodes.append(EpisodeResult(
episode_index=i,
correct=_check_correct(obs),
total_reward=total_reward,
steps=steps,
))
except Exception as exc:
episodes.append(EpisodeResult(
episode_index=i,
correct=False,
total_reward=0.0,
steps=0,
error=str(exc),
))
```
### Retry Strategy
| Operation | Retry? | Strategy |
|-----------|--------|----------|
| Episode evaluation | No | Record error, move to next episode |
| Environment reset | No | Record error, move to next episode |
---
## 6. Slice Plan (What we will ship, in order)
### Slice S1 -- Types, Protocol, and RandomPolicy
**Value:** Establishes the evaluation interface and provides a usable random baseline
**User-visible change:** Yes -- users can instantiate RandomPolicy and call select_action
**Interfaces introduced/changed:** Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy
**Rollback safety:** Purely additive -- new files only, no changes to existing code
### Slice S2 -- evaluate() Function and Integration Test
**Value:** Users can run `evaluate(env, random_policy, n_episodes=100)` and get structured metrics
**User-visible change:** Yes -- the core capability is now available
**Interfaces introduced/changed:** evaluate() function
**Rollback safety:** Purely additive -- extends S1 files, no changes to existing code
---
## 7. Implementation Steps
> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.
### Step 1.1: Types and Protocol
**Slice:** S1
**Goal:** Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass.
**Files:**
- `evaluation/__init__.py` - create - empty init with re-exports
- `evaluation/green_agent.py` - create - Protocol + dataclasses (no functions yet)
**Interface Changes:**
- New: `Policy` protocol with `select_action(observation: SQLObservation) -> SQLAction`
- New: `EpisodeResult` frozen dataclass
- New: `EvaluationResult` frozen dataclass
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** ??? Completed
**Completed:** 2026-03-27T23:51:09Z
**Changes Made:**
- Created `evaluation/__init__.py` with public re-exports for `Policy`, `EpisodeResult`, and `EvaluationResult`.
- Created `evaluation/green_agent.py` with the `Policy` runtime-checkable protocol and frozen `EpisodeResult`/`EvaluationResult` dataclasses.
**Result:**
- **Outcome:** ???
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/ -v
Result: 100 passed, 1 skipped
Scope: full project regression run after adding new evaluation types
```
- **Tests run:** `uv run --with pytest pytest tests/ -v`
- **Notes:**
- Dataclass and protocol scaffolding is additive and isolated to a new package.
- `pytest` is not installed in the project environment yet, so verification used `uv run --with pytest` for this step.
- Import fallback mirrors existing package-vs-standalone test collection behavior in the repo.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** ??? N/A
**Context for Next Step:**
- Types are defined and importable from `evaluation`
---
### Step 1.2: RandomPolicy Implementation
**Slice:** S1
**Goal:** Implement the RandomPolicy class that selects random actions based on observation state.
**Files:**
- `evaluation/green_agent.py` - modify - add RandomPolicy class
**Interface Changes:**
- New: `RandomPolicy.__init__(seed: int | None = None)`
- New: `RandomPolicy.select_action(observation: SQLObservation) -> SQLAction`
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** ??? Completed
**Completed:** 2026-03-27T23:55:10Z
**Changes Made:**
- Implemented `RandomPolicy` in `evaluation/green_agent.py` with seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction.
- Updated `evaluation/__init__.py` to re-export `RandomPolicy` from the public evaluation API.
- Added `tests/test_evaluation.py` with focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction).
**Result:**
- **Outcome:** ???
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/test_evaluation.py -v
Result: 6 passed
Scope: RandomPolicy unit coverage for F005 Step 1.2
Command: uv run --with pytest pytest tests/ -v
Result: 106 passed, 1 skipped
Scope: Full regression after RandomPolicy implementation
```
- **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
- **Notes:**
- RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step.
- Schema parsing intentionally handles both `- table` and `- table: columns...` observation formats.
- Verification commands in the spec referenced `tests/unit/...`; this repo uses a flat `tests/` layout, so tests were added in `tests/test_evaluation.py`.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** ??? N/A
**Context for Next Step:**
- RandomPolicy is implemented and exported from the public `evaluation` API
- Ready to implement `evaluate()` using per-episode loop and error isolation
---
### Step 2.1: evaluate() Function
**Slice:** S2
**Goal:** Implement the core evaluate() function with incremental collection and error isolation.
**Files:**
- `evaluation/green_agent.py` - modify - add evaluate() function
- `evaluation/__init__.py` - modify - add evaluate to re-exports
**Interface Changes:**
- New: `evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult`
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** ??? Completed
**Completed:** 2026-03-27T23:59:28Z
**Changes Made:**
- Added `evaluate()` to `evaluation/green_agent.py` with per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation.
- Added `evaluate` to `evaluation/__init__.py` public exports.
- Extended `tests/test_evaluation.py` with unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior.
**Result:**
- **Outcome:** ???
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/test_evaluation.py -v
Result: 14 passed
Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1
Command: uv run --with pytest pytest tests/ -v
Result: 114 passed, 1 skipped
Scope: Full regression after evaluate() implementation
```
- **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
- **Notes:**
- evaluate() computes aggregates using completed episodes only (`error is None`), matching the error-isolation behavior in the spec data flow.
- Progress callback is invoked once per attempted episode, including episodes that fail.
- Repository environment still does not include pytest by default, so verification used `uv run --with pytest`.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** ??? N/A
**Context for Next Step:**
- evaluate() is implemented, exported, and covered by focused unit tests
- Next step should add/expand integration coverage with a real `SQLEnvironment` evaluation run
---
### Step 2.2: Integration Test with SQLEnvironment
**Slice:** S2
**Goal:** Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment.
**Files:**
- `tests/test_evaluation.py` - create - unit tests for types + RandomPolicy + evaluate(); integration test with real env
**Interface Changes:**
None (test-only step).
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-28T00:04:03Z
**Changes Made:**
- Added `_build_sql_environment()` test helper in `tests/test_evaluation.py` to spin up a real SQLite-backed `SQLEnvironment` with a deterministic question fixture.
- Added `test_evaluate_integration_with_sql_environment` validating end-to-end `evaluate()` execution over 10 episodes with aggregate-metric consistency checks.
- Added `test_evaluate_integration_is_deterministic_with_seeds` validating deterministic full-result equality when both policy and environment seeds are fixed.
**Result:**
- **Outcome:** Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/test_evaluation.py -v
Result: 16 passed
Scope: evaluation unit + integration coverage including real SQLEnvironment flow
Command: uv run --with pytest pytest tests/ -v
Result: 116 passed, 1 skipped
Scope: full project regression after adding integration coverage
```
- **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
- **Notes:**
- Integration tests were implemented in `tests/test_evaluation.py` to match this repository's flat test layout.
- Verifier gate approved finalization in MVP mode after test evidence review.
- Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- All implementation steps are complete and verification gate passed.
---
## 8. Rollout Considerations
### Feature Flags
- [x] Required: No
- [ ] Flag name: N/A
### Migration
- [x] Data migration needed: No
- [ ] Migration strategy: N/A
### Rollback Plan
Delete the `evaluation/` directory. No other code references it.
---
## 9. Execution Tracking
All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history
---
## 9a. Slice Completion Protocol
After all steps in a slice pass verification:
1. **Run verifier subagent** for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
2. **Run compound-engineer subagent** to extract learnings
- **Mandatory invocation** after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
3. **Commit** the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
4. **Continue to next slice** (if more slices remain)
- Or proceed to final verification if all slices complete
**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.
---
## 10. User Value Summary
**Status:** Generated
### What Users Can Now Do
Run automated evaluation of any policy over N episodes with `evaluate(env, policy, n_episodes=100)` and get structured metrics including success rate, average reward, average steps, and per-episode breakdown.
### How to Access/Test
```python
from evaluation import evaluate, RandomPolicy
from server.sql_environment import SQLEnvironment
env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer)
policy = RandomPolicy(seed=42)
result = evaluate(env, policy, n_episodes=10, seed=42)
print(f"Success rate: {result.success_rate:.1%}")
print(f"Avg reward: {result.avg_reward:.3f}")
```
### Demo
- **Command:** `uv run python -c "from evaluation import evaluate, RandomPolicy; ..."`
### Release Notes Snippet
Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance.
---
## 11. PR Contract (Auto-Generated by autocode-next-step)
**Status:** Generated
### PR Title
feat(evaluation): complete green agent wrapper integration and finalization
### PR Summary
- Add deterministic integration coverage for `evaluate()` against a real `SQLEnvironment` fixture.
- Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation.
- Capture durable learnings under `docs/learnings/` for evaluation patterns and deterministic testing.
### Verification
- `uv run --with pytest pytest tests/test_evaluation.py -v`
- `uv run --with pytest pytest tests/ -v`
### Follow-up
All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10
---
## Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
---
## Human Checkpoint
**Before handing to AI agent:**
- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated
**Questions:**
1. Any remaining concerns?
2. Anything agent should know?
---
## Handoff Notes
**For the implementing AI agent:**
```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
```
---
*Specification completed: 2026-03-27*
*Approved by: --*
*Verification spec: VERIFICATION_SPEC.md*
*Verification input: [F005-VERIFICATION_INPUT.json](F005-VERIFICATION_INPUT.json)*
*Target agent: Claude Code*