Implementation Specification
Change: F005 -- Green Agent Wrapper (automated evaluation) Date: 2026-03-27 Research Summary: specs/F005-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Spec: Archived to specs/behavior/evaluation.md
Plan Status:
- Draft
- Approved for Implementation
- Implementation Complete
- Verification Passed
Core Intent (Immutable)
DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.
User Problem: Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained).
Success Criteria:
- Single function call:
evaluate(n_episodes=100)returns clean metrics dict - Built-in random policy for instant baseline comparison
- Results include per-episode breakdown for analysis
Avoid:
- Evaluation crashes partway through and loses all results
- No progress indicator for long evaluation runs
Out of Scope:
- Visualization / plotting of results
- WebSocket / remote environment support (local SQLEnvironment only)
- Elaborate policy class hierarchy
- Training loop integration (F006 will consume this API)
0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in small, mergeable increments.
Scope Budget
- Target: 2 slices
- Hard max: <= 10 steps total
- Each step must end in: implement -> verify -> merge
Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.
Each slice must have:
- Clear outcome
- Minimal interface change
- Merge criteria
Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).
Status Icons
Step Status:
- ??? Not Started
- ???? In Progress
- ??? Completed
- ???? Blocked/Failed
Result Outcome:
- ??? Fully Successful (all tests passed, no issues)
- ?????? Completed with Issues (needs follow-up)
- ???? Failed/Blocked
1. Implementation Overview
Summary
Create an evaluation/ subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a Policy protocol defining the interface for any policy, (2) an EpisodeResult dataclass for per-episode metrics, (3) an EvaluationResult dataclass for aggregate metrics, (4) a RandomPolicy class as a built-in baseline, and (5) an evaluate() function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified.
Scope
In Scope:
evaluation/__init__.py-- public API re-exportsevaluation/green_agent.py-- Protocol, dataclasses, RandomPolicy, evaluate()tests/test_evaluation.py-- unit + integration tests
Out of Scope:
- Modifications to
server/sql_environment.pyormodels.py - CLI entry point (future feature)
- Remote / WebSocket evaluation
- Plotting or visualization
1a. Execution Status
Progress: 4/4 steps complete Current Step: All planned implementation steps are complete Last Updated: 2026-03-28T00:04:03Z Latest Result: Fully Successful (Step 2.2 complete) Blockers: None
1b. Risk Assessment
Risk Tier: [x] Low | [ ] Medium | [ ] High
High-Risk Indicators Present: (check all that apply if tier is High)
- Touches authentication or authorization logic
- Handles payment processing or financial data
- Manages secrets, API keys, or credentials
- Processes untrusted user input (file uploads, external APIs)
- Modifies privilege/permission systems
Security Review Required: [ ] Yes (if High) | [x] No
Justification: Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns.
2. Change Manifest
Files to Create
| File | Purpose |
|---|---|
evaluation/__init__.py |
Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate |
evaluation/green_agent.py |
Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate() |
tests/test_evaluation.py |
Unit tests for types + RandomPolicy, integration test with SQLEnvironment |
Files to Modify
None.
Files to Delete
None.
3. Interface Specifications
New Types
# Location: evaluation/green_agent.py
from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable
@runtime_checkable
class Policy(Protocol):
"""Interface for any evaluation policy.
Any object with a select_action method matching this signature
is a valid policy (structural subtyping / duck typing).
"""
def select_action(self, observation: SQLObservation) -> SQLAction:
"""Choose an action given an observation."""
...
@dataclass(frozen=True)
class EpisodeResult:
"""Per-episode evaluation metrics."""
episode_index: int # 0-based episode number
correct: bool # Whether ANSWER action matched gold
total_reward: float # Cumulative reward for the episode
steps: int # Number of steps taken
error: str | None = None # Error message if episode failed
@dataclass(frozen=True)
class EvaluationResult:
"""Aggregate evaluation metrics with per-episode breakdown."""
success_rate: float # Fraction of correct episodes [0.0, 1.0]
avg_reward: float # Mean total_reward across episodes
avg_steps: float # Mean steps across episodes
n_episodes: int # Number of episodes attempted
n_completed: int # Episodes that ran to completion (no error)
episodes: list[EpisodeResult] # Per-episode breakdown
New Functions
# Location: evaluation/green_agent.py
class RandomPolicy:
"""Built-in random baseline policy.
Selects random action types and arguments. Deterministic given a seed.
"""
def __init__(self, seed: int | None = None) -> None:
"""
Args:
seed: Random seed for reproducibility. None = non-deterministic.
"""
def select_action(self, observation: SQLObservation) -> SQLAction:
"""Pick a random action based on current observation.
Strategy:
- If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY
- If budget_remaining == 1: always ANSWER with a random guess
- DESCRIBE/SAMPLE: pick a random table from schema_info
- QUERY: generate a simple SELECT * FROM <table> LIMIT 5
- ANSWER: pick a random value from last result or "unknown"
Args:
observation: Current environment observation
Returns:
A random SQLAction
"""
def evaluate(
env: SQLEnvironment,
policy: Policy,
n_episodes: int = 100,
*,
seed: int | None = None,
progress_callback: Callable[[int, int], None] | None = None,
) -> EvaluationResult:
"""Run automated evaluation of a policy over multiple episodes.
Collects results incrementally -- if an episode fails, it is recorded
as an error and evaluation continues with the next episode.
Args:
env: The SQLEnvironment instance to evaluate against.
policy: Any object satisfying the Policy protocol.
n_episodes: Number of episodes to run (0 returns empty result).
seed: Base seed for reproducibility. Episode i uses seed+i.
progress_callback: Optional callback(current, total) for progress.
Returns:
EvaluationResult with aggregate metrics and per-episode breakdown.
Raises:
ValueError: If n_episodes < 0.
"""
4. Data Flow
Primary Flow
1. evaluate(env, policy, n_episodes=100, seed=42)
- Input: environment, policy, episode count, optional seed
2. For each episode i in range(n_episodes):
a. obs = env.reset(seed=seed+i if seed else None)
b. While not obs.done:
- action = policy.select_action(obs)
- obs = env.step(action)
- Accumulate reward
c. Record EpisodeResult(correct=..., total_reward=..., steps=...)
d. Call progress_callback(i+1, n_episodes) if provided
3. Aggregate results:
- success_rate = sum(correct) / n_completed
- avg_reward = mean(total_reward) across completed
- avg_steps = mean(steps) across completed
4. Return EvaluationResult
Alternative Flows
When n_episodes=0:
1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0,
avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])
When episode raises exception:
1. Catch exception in the episode loop
2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0,
error=str(exception))
3. Continue to next episode
When env.reset() fails:
1. Catch exception
2. Record EpisodeResult with error, steps=0
3. Continue to next episode
5. Error Handling
Error Types
| Error | When | Handling |
|---|---|---|
ValueError |
n_episodes < 0 |
Raise immediately |
Exception during env.reset() |
DB not found, bad questions file | Catch, record as failed episode, continue |
Exception during policy.select_action() |
Policy bug | Catch, record as failed episode, continue |
Exception during env.step() |
Environment bug | Catch, record as failed episode, continue |
Error Handling Strategy
# Pattern: incremental collection with per-episode error isolation
for i in range(n_episodes):
try:
obs = env.reset(seed=episode_seed)
total_reward = 0.0
steps = 0
while not obs.done:
action = policy.select_action(obs)
obs = env.step(action)
total_reward += obs.reward or 0.0
steps += 1
episodes.append(EpisodeResult(
episode_index=i,
correct=_check_correct(obs),
total_reward=total_reward,
steps=steps,
))
except Exception as exc:
episodes.append(EpisodeResult(
episode_index=i,
correct=False,
total_reward=0.0,
steps=0,
error=str(exc),
))
Retry Strategy
| Operation | Retry? | Strategy |
|---|---|---|
| Episode evaluation | No | Record error, move to next episode |
| Environment reset | No | Record error, move to next episode |
6. Slice Plan (What we will ship, in order)
Slice S1 -- Types, Protocol, and RandomPolicy
Value: Establishes the evaluation interface and provides a usable random baseline User-visible change: Yes -- users can instantiate RandomPolicy and call select_action Interfaces introduced/changed: Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy Rollback safety: Purely additive -- new files only, no changes to existing code
Slice S2 -- evaluate() Function and Integration Test
Value: Users can run evaluate(env, random_policy, n_episodes=100) and get structured metrics
User-visible change: Yes -- the core capability is now available
Interfaces introduced/changed: evaluate() function
Rollback safety: Purely additive -- extends S1 files, no changes to existing code
7. Implementation Steps
VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.
Step 1.1: Types and Protocol
Slice: S1 Goal: Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass.
Files:
evaluation/__init__.py- create - empty init with re-exportsevaluation/green_agent.py- create - Protocol + dataclasses (no functions yet)
Interface Changes:
- New:
Policyprotocol withselect_action(observation: SQLObservation) -> SQLAction - New:
EpisodeResultfrozen dataclass - New:
EvaluationResultfrozen dataclass
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: ??? Completed
Completed: 2026-03-27T23:51:09Z Changes Made:
- Created
evaluation/__init__.pywith public re-exports forPolicy,EpisodeResult, andEvaluationResult. - Created
evaluation/green_agent.pywith thePolicyruntime-checkable protocol and frozenEpisodeResult/EvaluationResultdataclasses.
Result:
- Outcome: ???
- Evidence Captured:
Command: uv run --with pytest pytest tests/ -v Result: 100 passed, 1 skipped Scope: full project regression run after adding new evaluation types - Tests run:
uv run --with pytest pytest tests/ -v - Notes:
- Dataclass and protocol scaffolding is additive and isolated to a new package.
pytestis not installed in the project environment yet, so verification useduv run --with pytestfor this step.- Import fallback mirrors existing package-vs-standalone test collection behavior in the repo.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: ??? N/A
Context for Next Step:
- Types are defined and importable from
evaluation
Step 1.2: RandomPolicy Implementation
Slice: S1 Goal: Implement the RandomPolicy class that selects random actions based on observation state.
Files:
evaluation/green_agent.py- modify - add RandomPolicy class
Interface Changes:
- New:
RandomPolicy.__init__(seed: int | None = None) - New:
RandomPolicy.select_action(observation: SQLObservation) -> SQLAction
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: ??? Completed
Completed: 2026-03-27T23:55:10Z Changes Made:
- Implemented
RandomPolicyinevaluation/green_agent.pywith seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction. - Updated
evaluation/__init__.pyto re-exportRandomPolicyfrom the public evaluation API. - Added
tests/test_evaluation.pywith focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction).
Result:
- Outcome: ???
- Evidence Captured:
Command: uv run --with pytest pytest tests/test_evaluation.py -v Result: 6 passed Scope: RandomPolicy unit coverage for F005 Step 1.2 Command: uv run --with pytest pytest tests/ -v Result: 106 passed, 1 skipped Scope: Full regression after RandomPolicy implementation - Tests run:
uv run --with pytest pytest tests/test_evaluation.py -v;uv run --with pytest pytest tests/ -v - Notes:
- RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step.
- Schema parsing intentionally handles both
- tableand- table: columns...observation formats. - Verification commands in the spec referenced
tests/unit/...; this repo uses a flattests/layout, so tests were added intests/test_evaluation.py.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: ??? N/A
Context for Next Step:
- RandomPolicy is implemented and exported from the public
evaluationAPI - Ready to implement
evaluate()using per-episode loop and error isolation
Step 2.1: evaluate() Function
Slice: S2 Goal: Implement the core evaluate() function with incremental collection and error isolation.
Files:
evaluation/green_agent.py- modify - add evaluate() functionevaluation/__init__.py- modify - add evaluate to re-exports
Interface Changes:
- New:
evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: ??? Completed
Completed: 2026-03-27T23:59:28Z Changes Made:
- Added
evaluate()toevaluation/green_agent.pywith per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation. - Added
evaluatetoevaluation/__init__.pypublic exports. - Extended
tests/test_evaluation.pywith unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior.
Result:
- Outcome: ???
- Evidence Captured:
Command: uv run --with pytest pytest tests/test_evaluation.py -v Result: 14 passed Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1 Command: uv run --with pytest pytest tests/ -v Result: 114 passed, 1 skipped Scope: Full regression after evaluate() implementation - Tests run:
uv run --with pytest pytest tests/test_evaluation.py -v;uv run --with pytest pytest tests/ -v - Notes:
- evaluate() computes aggregates using completed episodes only (
error is None), matching the error-isolation behavior in the spec data flow. - Progress callback is invoked once per attempted episode, including episodes that fail.
- Repository environment still does not include pytest by default, so verification used
uv run --with pytest.
- evaluate() computes aggregates using completed episodes only (
- Issues: None
- Follow-ups Created: None
- Human Review Completed: ??? N/A
Context for Next Step:
- evaluate() is implemented, exported, and covered by focused unit tests
- Next step should add/expand integration coverage with a real
SQLEnvironmentevaluation run
Step 2.2: Integration Test with SQLEnvironment
Slice: S2 Goal: Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment.
Files:
tests/test_evaluation.py- create - unit tests for types + RandomPolicy + evaluate(); integration test with real env
Interface Changes: None (test-only step).
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-28T00:04:03Z Changes Made:
- Added
_build_sql_environment()test helper intests/test_evaluation.pyto spin up a real SQLite-backedSQLEnvironmentwith a deterministic question fixture. - Added
test_evaluate_integration_with_sql_environmentvalidating end-to-endevaluate()execution over 10 episodes with aggregate-metric consistency checks. - Added
test_evaluate_integration_is_deterministic_with_seedsvalidating deterministic full-result equality when both policy and environment seeds are fixed.
Result:
- Outcome: Fully Successful
- Evidence Captured:
Command: uv run --with pytest pytest tests/test_evaluation.py -v Result: 16 passed Scope: evaluation unit + integration coverage including real SQLEnvironment flow Command: uv run --with pytest pytest tests/ -v Result: 116 passed, 1 skipped Scope: full project regression after adding integration coverage - Tests run:
uv run --with pytest pytest tests/test_evaluation.py -v;uv run --with pytest pytest tests/ -v - Notes:
- Integration tests were implemented in
tests/test_evaluation.pyto match this repository's flat test layout. - Verifier gate approved finalization in MVP mode after test evidence review.
- Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- All implementation steps are complete and verification gate passed.
8. Rollout Considerations
Feature Flags
- Required: No
- Flag name: N/A
Migration
- Data migration needed: No
- Migration strategy: N/A
Rollback Plan
Delete the evaluation/ directory. No other code references it.
9. Execution Tracking
All execution state is tracked within this document:
- Section 1a: Overall progress summary
- Section 7: Per-step completion details, test results, and handoff context
- FEATURES.json: Feature-level status/progress metadata used by
/autocode-next-stepandopencode-ctx ralph run - Git history: Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's
progressandstatusfields inFEATURES.json - Running
git log --oneline IMPLEMENTATION_SPEC.mdfor change history
9a. Slice Completion Protocol
After all steps in a slice pass verification:
Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete
Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.
10. User Value Summary
Status: Generated
What Users Can Now Do
Run automated evaluation of any policy over N episodes with evaluate(env, policy, n_episodes=100) and get structured metrics including success rate, average reward, average steps, and per-episode breakdown.
How to Access/Test
from evaluation import evaluate, RandomPolicy
from server.sql_environment import SQLEnvironment
env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer)
policy = RandomPolicy(seed=42)
result = evaluate(env, policy, n_episodes=10, seed=42)
print(f"Success rate: {result.success_rate:.1%}")
print(f"Avg reward: {result.avg_reward:.3f}")
Demo
- Command:
uv run python -c "from evaluation import evaluate, RandomPolicy; ..."
Release Notes Snippet
Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance.
11. PR Contract (Auto-Generated by autocode-next-step)
Status: Generated
PR Title
feat(evaluation): complete green agent wrapper integration and finalization
PR Summary
- Add deterministic integration coverage for
evaluate()against a realSQLEnvironmentfixture. - Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation.
- Capture durable learnings under
docs/learnings/for evaluation patterns and deterministic testing.
Verification
uv run --with pytest pytest tests/test_evaluation.py -vuv run --with pytest pytest tests/ -v
Follow-up
All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10
Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than 3 files in unrelated areas
- You need to introduce multiple new abstractions "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
Human Checkpoint
Before handing to AI agent:
- Interface specifications are complete
- Data flow is accurate
- Error handling is specified
- Implementation order makes sense
- VERIFICATION_SPEC.md has been generated
Questions:
- Any remaining concerns?
- Anything agent should know?
Handoff Notes
For the implementing AI agent:
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Specification completed: 2026-03-27 Approved by: -- Verification spec: VERIFICATION_SPEC.md Verification input: F005-VERIFICATION_INPUT.json Target agent: Claude Code