# Implementation Specification **Change:** F005 -- Green Agent Wrapper (automated evaluation) **Date:** 2026-03-27 **Research Summary:** [specs/F005-RESEARCH_SUMMARY.md](F005-RESEARCH_SUMMARY.md) **Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner) **Behavior Spec:** Archived to [specs/behavior/evaluation.md](behavior/evaluation.md) **Plan Status:** - [x] Draft - [x] Approved for Implementation - [x] Implementation Complete - [x] Verification Passed --- ## Core Intent (Immutable) > **DO NOT MODIFY THIS SECTION DURING REFINEMENT** > Changes to Core Intent mean you're describing a different feature. > If refinement reveals the need to change this section, create a new feature instead. **User Problem:** Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained). **Success Criteria:** - Single function call: `evaluate(n_episodes=100)` returns clean metrics dict - Built-in random policy for instant baseline comparison - Results include per-episode breakdown for analysis **Avoid:** - Evaluation crashes partway through and loses all results - No progress indicator for long evaluation runs **Out of Scope:** - Visualization / plotting of results - WebSocket / remote environment support (local SQLEnvironment only) - Elaborate policy class hierarchy - Training loop integration (F006 will consume this API) --- ## 0. Slicing & Scope Budget (Anti-Waterfall) This spec must be executable in **small, mergeable increments**. ### Scope Budget - Target: **2 slices** - Hard max: **<= 10 steps total** - Each step must end in: **implement -> verify -> merge** ### Slice Definition A slice is a vertical increment that delivers user-visible value or a safe internal capability. **Each slice must have:** - Clear outcome - Minimal interface change - Merge criteria **Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent). ## Status Icons **Step Status:** - ??? Not Started - ???? In Progress - ??? Completed - ???? Blocked/Failed **Result Outcome:** - ??? Fully Successful (all tests passed, no issues) - ?????? Completed with Issues (needs follow-up) - ???? Failed/Blocked --- ## 1. Implementation Overview ### Summary Create an `evaluation/` subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a `Policy` protocol defining the interface for any policy, (2) an `EpisodeResult` dataclass for per-episode metrics, (3) an `EvaluationResult` dataclass for aggregate metrics, (4) a `RandomPolicy` class as a built-in baseline, and (5) an `evaluate()` function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified. ### Scope **In Scope:** - `evaluation/__init__.py` -- public API re-exports - `evaluation/green_agent.py` -- Protocol, dataclasses, RandomPolicy, evaluate() - `tests/test_evaluation.py` -- unit + integration tests **Out of Scope:** - Modifications to `server/sql_environment.py` or `models.py` - CLI entry point (future feature) - Remote / WebSocket evaluation - Plotting or visualization --- ## 1a. Execution Status **Progress:** 4/4 steps complete **Current Step:** All planned implementation steps are complete **Last Updated:** 2026-03-28T00:04:03Z **Latest Result:** Fully Successful (Step 2.2 complete) **Blockers:** None --- ## 1b. Risk Assessment **Risk Tier:** [x] Low | [ ] Medium | [ ] High **High-Risk Indicators Present:** (check all that apply if tier is High) - [ ] Touches authentication or authorization logic - [ ] Handles payment processing or financial data - [ ] Manages secrets, API keys, or credentials - [ ] Processes untrusted user input (file uploads, external APIs) - [ ] Modifies privilege/permission systems **Security Review Required:** [ ] Yes (if High) | [x] No **Justification:** Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns. --- ## 2. Change Manifest ### Files to Create | File | Purpose | |------|---------| | `evaluation/__init__.py` | Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate | | `evaluation/green_agent.py` | Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate() | | `tests/test_evaluation.py` | Unit tests for types + RandomPolicy, integration test with SQLEnvironment | ### Files to Modify None. ### Files to Delete None. --- ## 3. Interface Specifications ### New Types ```python # Location: evaluation/green_agent.py from dataclasses import dataclass, field from typing import Protocol, runtime_checkable @runtime_checkable class Policy(Protocol): """Interface for any evaluation policy. Any object with a select_action method matching this signature is a valid policy (structural subtyping / duck typing). """ def select_action(self, observation: SQLObservation) -> SQLAction: """Choose an action given an observation.""" ... @dataclass(frozen=True) class EpisodeResult: """Per-episode evaluation metrics.""" episode_index: int # 0-based episode number correct: bool # Whether ANSWER action matched gold total_reward: float # Cumulative reward for the episode steps: int # Number of steps taken error: str | None = None # Error message if episode failed @dataclass(frozen=True) class EvaluationResult: """Aggregate evaluation metrics with per-episode breakdown.""" success_rate: float # Fraction of correct episodes [0.0, 1.0] avg_reward: float # Mean total_reward across episodes avg_steps: float # Mean steps across episodes n_episodes: int # Number of episodes attempted n_completed: int # Episodes that ran to completion (no error) episodes: list[EpisodeResult] # Per-episode breakdown ``` ### New Functions ```python # Location: evaluation/green_agent.py class RandomPolicy: """Built-in random baseline policy. Selects random action types and arguments. Deterministic given a seed. """ def __init__(self, seed: int | None = None) -> None: """ Args: seed: Random seed for reproducibility. None = non-deterministic. """ def select_action(self, observation: SQLObservation) -> SQLAction: """Pick a random action based on current observation. Strategy: - If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY - If budget_remaining == 1: always ANSWER with a random guess - DESCRIBE/SAMPLE: pick a random table from schema_info - QUERY: generate a simple SELECT * FROM LIMIT 5 - ANSWER: pick a random value from last result or "unknown" Args: observation: Current environment observation Returns: A random SQLAction """ def evaluate( env: SQLEnvironment, policy: Policy, n_episodes: int = 100, *, seed: int | None = None, progress_callback: Callable[[int, int], None] | None = None, ) -> EvaluationResult: """Run automated evaluation of a policy over multiple episodes. Collects results incrementally -- if an episode fails, it is recorded as an error and evaluation continues with the next episode. Args: env: The SQLEnvironment instance to evaluate against. policy: Any object satisfying the Policy protocol. n_episodes: Number of episodes to run (0 returns empty result). seed: Base seed for reproducibility. Episode i uses seed+i. progress_callback: Optional callback(current, total) for progress. Returns: EvaluationResult with aggregate metrics and per-episode breakdown. Raises: ValueError: If n_episodes < 0. """ ``` --- ## 4. Data Flow ### Primary Flow ``` 1. evaluate(env, policy, n_episodes=100, seed=42) - Input: environment, policy, episode count, optional seed 2. For each episode i in range(n_episodes): a. obs = env.reset(seed=seed+i if seed else None) b. While not obs.done: - action = policy.select_action(obs) - obs = env.step(action) - Accumulate reward c. Record EpisodeResult(correct=..., total_reward=..., steps=...) d. Call progress_callback(i+1, n_episodes) if provided 3. Aggregate results: - success_rate = sum(correct) / n_completed - avg_reward = mean(total_reward) across completed - avg_steps = mean(steps) across completed 4. Return EvaluationResult ``` ### Alternative Flows **When n_episodes=0:** ``` 1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0, avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[]) ``` **When episode raises exception:** ``` 1. Catch exception in the episode loop 2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0, error=str(exception)) 3. Continue to next episode ``` **When env.reset() fails:** ``` 1. Catch exception 2. Record EpisodeResult with error, steps=0 3. Continue to next episode ``` --- ## 5. Error Handling ### Error Types | Error | When | Handling | |-------|------|----------| | `ValueError` | `n_episodes < 0` | Raise immediately | | `Exception` during `env.reset()` | DB not found, bad questions file | Catch, record as failed episode, continue | | `Exception` during `policy.select_action()` | Policy bug | Catch, record as failed episode, continue | | `Exception` during `env.step()` | Environment bug | Catch, record as failed episode, continue | ### Error Handling Strategy ```python # Pattern: incremental collection with per-episode error isolation for i in range(n_episodes): try: obs = env.reset(seed=episode_seed) total_reward = 0.0 steps = 0 while not obs.done: action = policy.select_action(obs) obs = env.step(action) total_reward += obs.reward or 0.0 steps += 1 episodes.append(EpisodeResult( episode_index=i, correct=_check_correct(obs), total_reward=total_reward, steps=steps, )) except Exception as exc: episodes.append(EpisodeResult( episode_index=i, correct=False, total_reward=0.0, steps=0, error=str(exc), )) ``` ### Retry Strategy | Operation | Retry? | Strategy | |-----------|--------|----------| | Episode evaluation | No | Record error, move to next episode | | Environment reset | No | Record error, move to next episode | --- ## 6. Slice Plan (What we will ship, in order) ### Slice S1 -- Types, Protocol, and RandomPolicy **Value:** Establishes the evaluation interface and provides a usable random baseline **User-visible change:** Yes -- users can instantiate RandomPolicy and call select_action **Interfaces introduced/changed:** Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy **Rollback safety:** Purely additive -- new files only, no changes to existing code ### Slice S2 -- evaluate() Function and Integration Test **Value:** Users can run `evaluate(env, random_policy, n_episodes=100)` and get structured metrics **User-visible change:** Yes -- the core capability is now available **Interfaces introduced/changed:** evaluate() function **Rollback safety:** Purely additive -- extends S1 files, no changes to existing code --- ## 7. Implementation Steps > **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md. > The verification-planner (separate agent) generated independent test criteria. > Run the tests specified there after implementing each step. ### Step 1.1: Types and Protocol **Slice:** S1 **Goal:** Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass. **Files:** - `evaluation/__init__.py` - create - empty init with re-exports - `evaluation/green_agent.py` - create - Protocol + dataclasses (no functions yet) **Interface Changes:** - New: `Policy` protocol with `select_action(observation: SQLObservation) -> SQLAction` - New: `EpisodeResult` frozen dataclass - New: `EvaluationResult` frozen dataclass **Verification:** > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High **Merge Criteria:** - [x] Tests from VERIFICATION_SPEC.md pass - [x] No TODOs left in changed code (or explicitly tracked) - [x] Backwards compatible (or flag/migration documented) **Status:** ??? Completed **Completed:** 2026-03-27T23:51:09Z **Changes Made:** - Created `evaluation/__init__.py` with public re-exports for `Policy`, `EpisodeResult`, and `EvaluationResult`. - Created `evaluation/green_agent.py` with the `Policy` runtime-checkable protocol and frozen `EpisodeResult`/`EvaluationResult` dataclasses. **Result:** - **Outcome:** ??? - **Evidence Captured:** ``` Command: uv run --with pytest pytest tests/ -v Result: 100 passed, 1 skipped Scope: full project regression run after adding new evaluation types ``` - **Tests run:** `uv run --with pytest pytest tests/ -v` - **Notes:** - Dataclass and protocol scaffolding is additive and isolated to a new package. - `pytest` is not installed in the project environment yet, so verification used `uv run --with pytest` for this step. - Import fallback mirrors existing package-vs-standalone test collection behavior in the repo. - **Issues:** None - **Follow-ups Created:** None - **Human Review Completed:** ??? N/A **Context for Next Step:** - Types are defined and importable from `evaluation` --- ### Step 1.2: RandomPolicy Implementation **Slice:** S1 **Goal:** Implement the RandomPolicy class that selects random actions based on observation state. **Files:** - `evaluation/green_agent.py` - modify - add RandomPolicy class **Interface Changes:** - New: `RandomPolicy.__init__(seed: int | None = None)` - New: `RandomPolicy.select_action(observation: SQLObservation) -> SQLAction` **Verification:** > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High **Merge Criteria:** - [x] Tests from VERIFICATION_SPEC.md pass - [x] No TODOs left in changed code (or explicitly tracked) - [x] Backwards compatible (or flag/migration documented) **Status:** ??? Completed **Completed:** 2026-03-27T23:55:10Z **Changes Made:** - Implemented `RandomPolicy` in `evaluation/green_agent.py` with seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction. - Updated `evaluation/__init__.py` to re-export `RandomPolicy` from the public evaluation API. - Added `tests/test_evaluation.py` with focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction). **Result:** - **Outcome:** ??? - **Evidence Captured:** ``` Command: uv run --with pytest pytest tests/test_evaluation.py -v Result: 6 passed Scope: RandomPolicy unit coverage for F005 Step 1.2 Command: uv run --with pytest pytest tests/ -v Result: 106 passed, 1 skipped Scope: Full regression after RandomPolicy implementation ``` - **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v` - **Notes:** - RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step. - Schema parsing intentionally handles both `- table` and `- table: columns...` observation formats. - Verification commands in the spec referenced `tests/unit/...`; this repo uses a flat `tests/` layout, so tests were added in `tests/test_evaluation.py`. - **Issues:** None - **Follow-ups Created:** None - **Human Review Completed:** ??? N/A **Context for Next Step:** - RandomPolicy is implemented and exported from the public `evaluation` API - Ready to implement `evaluate()` using per-episode loop and error isolation --- ### Step 2.1: evaluate() Function **Slice:** S2 **Goal:** Implement the core evaluate() function with incremental collection and error isolation. **Files:** - `evaluation/green_agent.py` - modify - add evaluate() function - `evaluation/__init__.py` - modify - add evaluate to re-exports **Interface Changes:** - New: `evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult` **Verification:** > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High **Merge Criteria:** - [x] Tests from VERIFICATION_SPEC.md pass - [x] No TODOs left in changed code (or explicitly tracked) - [x] Backwards compatible (or flag/migration documented) **Status:** ??? Completed **Completed:** 2026-03-27T23:59:28Z **Changes Made:** - Added `evaluate()` to `evaluation/green_agent.py` with per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation. - Added `evaluate` to `evaluation/__init__.py` public exports. - Extended `tests/test_evaluation.py` with unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior. **Result:** - **Outcome:** ??? - **Evidence Captured:** ``` Command: uv run --with pytest pytest tests/test_evaluation.py -v Result: 14 passed Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1 Command: uv run --with pytest pytest tests/ -v Result: 114 passed, 1 skipped Scope: Full regression after evaluate() implementation ``` - **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v` - **Notes:** - evaluate() computes aggregates using completed episodes only (`error is None`), matching the error-isolation behavior in the spec data flow. - Progress callback is invoked once per attempted episode, including episodes that fail. - Repository environment still does not include pytest by default, so verification used `uv run --with pytest`. - **Issues:** None - **Follow-ups Created:** None - **Human Review Completed:** ??? N/A **Context for Next Step:** - evaluate() is implemented, exported, and covered by focused unit tests - Next step should add/expand integration coverage with a real `SQLEnvironment` evaluation run --- ### Step 2.2: Integration Test with SQLEnvironment **Slice:** S2 **Goal:** Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment. **Files:** - `tests/test_evaluation.py` - create - unit tests for types + RandomPolicy + evaluate(); integration test with real env **Interface Changes:** None (test-only step). **Verification:** > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High **Merge Criteria:** - [x] Tests from VERIFICATION_SPEC.md pass - [x] No TODOs left in changed code (or explicitly tracked) - [x] Backwards compatible (or flag/migration documented) **Status:** Completed **Completed:** 2026-03-28T00:04:03Z **Changes Made:** - Added `_build_sql_environment()` test helper in `tests/test_evaluation.py` to spin up a real SQLite-backed `SQLEnvironment` with a deterministic question fixture. - Added `test_evaluate_integration_with_sql_environment` validating end-to-end `evaluate()` execution over 10 episodes with aggregate-metric consistency checks. - Added `test_evaluate_integration_is_deterministic_with_seeds` validating deterministic full-result equality when both policy and environment seeds are fixed. **Result:** - **Outcome:** Fully Successful - **Evidence Captured:** ``` Command: uv run --with pytest pytest tests/test_evaluation.py -v Result: 16 passed Scope: evaluation unit + integration coverage including real SQLEnvironment flow Command: uv run --with pytest pytest tests/ -v Result: 116 passed, 1 skipped Scope: full project regression after adding integration coverage ``` - **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v` - **Notes:** - Integration tests were implemented in `tests/test_evaluation.py` to match this repository's flat test layout. - Verifier gate approved finalization in MVP mode after test evidence review. - Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed. - **Issues:** None - **Follow-ups Created:** None - **Human Review Completed:** N/A **Context for Next Step:** - All implementation steps are complete and verification gate passed. --- ## 8. Rollout Considerations ### Feature Flags - [x] Required: No - [ ] Flag name: N/A ### Migration - [x] Data migration needed: No - [ ] Migration strategy: N/A ### Rollback Plan Delete the `evaluation/` directory. No other code references it. --- ## 9. Execution Tracking All execution state is tracked within this document: - **Section 1a:** Overall progress summary - **Section 7:** Per-step completion details, test results, and handoff context - **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run` - **Git history:** Full audit trail of changes to this file The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by: - Checking Section 1a for summary - Reviewing Section 7 for detailed step status - Inspecting the feature's `progress` and `status` fields in `FEATURES.json` - Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history --- ## 9a. Slice Completion Protocol After all steps in a slice pass verification: 1. **Run verifier subagent** for spec compliance - Validates against VERIFICATION_SPEC.md criteria - Ensures no TODOs or incomplete work in slice 2. **Run compound-engineer subagent** to extract learnings - **Mandatory invocation** after every slice completion - Updates CLAUDE.md Learnings section (if durable patterns found) - May exit with "no update needed" (valid for routine work) 3. **Commit** the slice changes - Follow commit message format in CLAUDE.md - Each slice gets its own atomic commit 4. **Continue to next slice** (if more slices remain) - Or proceed to final verification if all slices complete **Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready. --- ## 10. User Value Summary **Status:** Generated ### What Users Can Now Do Run automated evaluation of any policy over N episodes with `evaluate(env, policy, n_episodes=100)` and get structured metrics including success rate, average reward, average steps, and per-episode breakdown. ### How to Access/Test ```python from evaluation import evaluate, RandomPolicy from server.sql_environment import SQLEnvironment env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer) policy = RandomPolicy(seed=42) result = evaluate(env, policy, n_episodes=10, seed=42) print(f"Success rate: {result.success_rate:.1%}") print(f"Avg reward: {result.avg_reward:.3f}") ``` ### Demo - **Command:** `uv run python -c "from evaluation import evaluate, RandomPolicy; ..."` ### Release Notes Snippet Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance. --- ## 11. PR Contract (Auto-Generated by autocode-next-step) **Status:** Generated ### PR Title feat(evaluation): complete green agent wrapper integration and finalization ### PR Summary - Add deterministic integration coverage for `evaluate()` against a real `SQLEnvironment` fixture. - Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation. - Capture durable learnings under `docs/learnings/` for evaluation patterns and deterministic testing. ### Verification - `uv run --with pytest pytest tests/test_evaluation.py -v` - `uv run --with pytest pytest tests/ -v` ### Follow-up All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10 --- ## Stop Conditions (When to Split This Spec) Stop and create a new IMPLEMENTATION_SPEC if: - A step requires touching more than **3 files** in unrelated areas - You need to introduce **multiple new abstractions** "just in case" - Verification cannot be made targeted and concrete - You discover new unknowns that change the plan materially - The next slice cannot be merged safely without finishing later slices When splitting, ensure the current slice ends in a merged, stable state. --- ## Human Checkpoint **Before handing to AI agent:** - [ ] Interface specifications are complete - [ ] Data flow is accurate - [ ] Error handling is specified - [ ] Implementation order makes sense - [ ] VERIFICATION_SPEC.md has been generated **Questions:** 1. Any remaining concerns? 2. Anything agent should know? --- ## Handoff Notes **For the implementing AI agent:** ``` Context: See RESEARCH_SUMMARY.md for system understanding Spec: Follow this document exactly Verification: Use tests from VERIFICATION_SPEC.md (independent agent) Ambiguity: Stop and ask rather than assume Order: Follow implementation order exactly ``` --- *Specification completed: 2026-03-27* *Approved by: --* *Verification spec: VERIFICATION_SPEC.md* *Verification input: [F005-VERIFICATION_INPUT.json](F005-VERIFICATION_INPUT.json)* *Target agent: Claude Code*