Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F005-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

26.3 kB

Implementation Specification

Change: F005 -- Green Agent Wrapper (automated evaluation) Date: 2026-03-27 Research Summary: specs/F005-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Spec: Archived to specs/behavior/evaluation.md

Plan Status:

Draft
Approved for Implementation
Implementation Complete
Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained).

Success Criteria:

Single function call: evaluate(n_episodes=100) returns clean metrics dict
Built-in random policy for instant baseline comparison
Results include per-episode breakdown for analysis

Avoid:

Evaluation crashes partway through and loses all results
No progress indicator for long evaluation runs

Out of Scope:

Visualization / plotting of results
WebSocket / remote environment support (local SQLEnvironment only)
Elaborate policy class hierarchy
Training loop integration (F006 will consume this API)

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

Target: 2 slices
Hard max: <= 10 steps total
Each step must end in: implement -> verify -> merge

Slice Definition

A slice is a vertical increment that delivers user-visible value or a safe internal capability.

Each slice must have:

Clear outcome
Minimal interface change
Merge criteria

Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

Status Icons

Step Status:

??? Not Started
???? In Progress
??? Completed
???? Blocked/Failed

Result Outcome:

??? Fully Successful (all tests passed, no issues)
?????? Completed with Issues (needs follow-up)
???? Failed/Blocked

1. Implementation Overview

Summary

Create an evaluation/ subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a Policy protocol defining the interface for any policy, (2) an EpisodeResult dataclass for per-episode metrics, (3) an EvaluationResult dataclass for aggregate metrics, (4) a RandomPolicy class as a built-in baseline, and (5) an evaluate() function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified.

Scope

In Scope:

evaluation/__init__.py -- public API re-exports
evaluation/green_agent.py -- Protocol, dataclasses, RandomPolicy, evaluate()
tests/test_evaluation.py -- unit + integration tests

Out of Scope:

Modifications to server/sql_environment.py or models.py
CLI entry point (future feature)
Remote / WebSocket evaluation
Plotting or visualization

1a. Execution Status

Progress: 4/4 steps complete Current Step: All planned implementation steps are complete Last Updated: 2026-03-28T00:04:03Z Latest Result: Fully Successful (Step 2.2 complete) Blockers: None

1b. Risk Assessment

Risk Tier: [x] Low | [ ] Medium | [ ] High

High-Risk Indicators Present: (check all that apply if tier is High)

Touches authentication or authorization logic
Handles payment processing or financial data
Manages secrets, API keys, or credentials
Processes untrusted user input (file uploads, external APIs)
Modifies privilege/permission systems

Security Review Required: [ ] Yes (if High) | [x] No

Justification: Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns.

2. Change Manifest

Files to Create

File	Purpose
`evaluation/__init__.py`	Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate
`evaluation/green_agent.py`	Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate()
`tests/test_evaluation.py`	Unit tests for types + RandomPolicy, integration test with SQLEnvironment

Files to Modify

None.

Files to Delete

None.

3. Interface Specifications

New Types

# Location: evaluation/green_agent.py

from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable

@runtime_checkable
class Policy(Protocol):
    """Interface for any evaluation policy.

    Any object with a select_action method matching this signature
    is a valid policy (structural subtyping / duck typing).
    """

    def select_action(self, observation: SQLObservation) -> SQLAction:
        """Choose an action given an observation."""
        ...


@dataclass(frozen=True)
class EpisodeResult:
    """Per-episode evaluation metrics."""

    episode_index: int          # 0-based episode number
    correct: bool               # Whether ANSWER action matched gold
    total_reward: float         # Cumulative reward for the episode
    steps: int                  # Number of steps taken
    error: str | None = None    # Error message if episode failed


@dataclass(frozen=True)
class EvaluationResult:
    """Aggregate evaluation metrics with per-episode breakdown."""

    success_rate: float                 # Fraction of correct episodes [0.0, 1.0]
    avg_reward: float                   # Mean total_reward across episodes
    avg_steps: float                    # Mean steps across episodes
    n_episodes: int                     # Number of episodes attempted
    n_completed: int                    # Episodes that ran to completion (no error)
    episodes: list[EpisodeResult]       # Per-episode breakdown

New Functions

# Location: evaluation/green_agent.py

class RandomPolicy:
    """Built-in random baseline policy.

    Selects random action types and arguments. Deterministic given a seed.
    """

    def __init__(self, seed: int | None = None) -> None:
        """
        Args:
            seed: Random seed for reproducibility. None = non-deterministic.
        """

    def select_action(self, observation: SQLObservation) -> SQLAction:
        """Pick a random action based on current observation.

        Strategy:
        - If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY
        - If budget_remaining == 1: always ANSWER with a random guess
        - DESCRIBE/SAMPLE: pick a random table from schema_info
        - QUERY: generate a simple SELECT * FROM <table> LIMIT 5
        - ANSWER: pick a random value from last result or "unknown"

        Args:
            observation: Current environment observation

        Returns:
            A random SQLAction
        """


def evaluate(
    env: SQLEnvironment,
    policy: Policy,
    n_episodes: int = 100,
    *,
    seed: int | None = None,
    progress_callback: Callable[[int, int], None] | None = None,
) -> EvaluationResult:
    """Run automated evaluation of a policy over multiple episodes.

    Collects results incrementally -- if an episode fails, it is recorded
    as an error and evaluation continues with the next episode.

    Args:
        env: The SQLEnvironment instance to evaluate against.
        policy: Any object satisfying the Policy protocol.
        n_episodes: Number of episodes to run (0 returns empty result).
        seed: Base seed for reproducibility. Episode i uses seed+i.
        progress_callback: Optional callback(current, total) for progress.

    Returns:
        EvaluationResult with aggregate metrics and per-episode breakdown.

    Raises:
        ValueError: If n_episodes < 0.
    """

4. Data Flow

Primary Flow

1. evaluate(env, policy, n_episodes=100, seed=42)
   - Input: environment, policy, episode count, optional seed

2. For each episode i in range(n_episodes):
   a. obs = env.reset(seed=seed+i if seed else None)
   b. While not obs.done:
      - action = policy.select_action(obs)
      - obs = env.step(action)
      - Accumulate reward
   c. Record EpisodeResult(correct=..., total_reward=..., steps=...)
   d. Call progress_callback(i+1, n_episodes) if provided

3. Aggregate results:
   - success_rate = sum(correct) / n_completed
   - avg_reward = mean(total_reward) across completed
   - avg_steps = mean(steps) across completed

4. Return EvaluationResult

Alternative Flows

When n_episodes=0:

1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0,
     avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])

When episode raises exception:

1. Catch exception in the episode loop
2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0,
     error=str(exception))
3. Continue to next episode

When env.reset() fails:

1. Catch exception
2. Record EpisodeResult with error, steps=0
3. Continue to next episode

5. Error Handling

Error Types

Error	When	Handling
`ValueError`	`n_episodes < 0`	Raise immediately
`Exception` during `env.reset()`	DB not found, bad questions file	Catch, record as failed episode, continue
`Exception` during `policy.select_action()`	Policy bug	Catch, record as failed episode, continue
`Exception` during `env.step()`	Environment bug	Catch, record as failed episode, continue

Error Handling Strategy

# Pattern: incremental collection with per-episode error isolation
for i in range(n_episodes):
    try:
        obs = env.reset(seed=episode_seed)
        total_reward = 0.0
        steps = 0
        while not obs.done:
            action = policy.select_action(obs)
            obs = env.step(action)
            total_reward += obs.reward or 0.0
            steps += 1
        episodes.append(EpisodeResult(
            episode_index=i,
            correct=_check_correct(obs),
            total_reward=total_reward,
            steps=steps,
        ))
    except Exception as exc:
        episodes.append(EpisodeResult(
            episode_index=i,
            correct=False,
            total_reward=0.0,
            steps=0,
            error=str(exc),
        ))

Retry Strategy

Operation	Retry?	Strategy
Episode evaluation	No	Record error, move to next episode
Environment reset	No	Record error, move to next episode

6. Slice Plan (What we will ship, in order)

Slice S1 -- Types, Protocol, and RandomPolicy

Value: Establishes the evaluation interface and provides a usable random baseline User-visible change: Yes -- users can instantiate RandomPolicy and call select_action Interfaces introduced/changed: Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy Rollback safety: Purely additive -- new files only, no changes to existing code

Slice S2 -- evaluate() Function and Integration Test

Value: Users can run evaluate(env, random_policy, n_episodes=100) and get structured metrics User-visible change: Yes -- the core capability is now available Interfaces introduced/changed: evaluate() function Rollback safety: Purely additive -- extends S1 files, no changes to existing code

7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Types and Protocol

Slice: S1 Goal: Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass.

Files:

evaluation/__init__.py - create - empty init with re-exports
evaluation/green_agent.py - create - Protocol + dataclasses (no functions yet)

Interface Changes:

New: Policy protocol with select_action(observation: SQLObservation) -> SQLAction
New: EpisodeResult frozen dataclass
New: EvaluationResult frozen dataclass

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: ??? Completed

Completed: 2026-03-27T23:51:09Z Changes Made:

Created evaluation/__init__.py with public re-exports for Policy, EpisodeResult, and EvaluationResult.
Created evaluation/green_agent.py with the Policy runtime-checkable protocol and frozen EpisodeResult/EvaluationResult dataclasses.

Result:

Outcome: ???

Evidence Captured:

Command: uv run --with pytest pytest tests/ -v
Result: 100 passed, 1 skipped
Scope: full project regression run after adding new evaluation types

Tests run: uv run --with pytest pytest tests/ -v
Notes:
- Dataclass and protocol scaffolding is additive and isolated to a new package.
- pytest is not installed in the project environment yet, so verification used uv run --with pytest for this step.
- Import fallback mirrors existing package-vs-standalone test collection behavior in the repo.
Issues: None
Follow-ups Created: None
Human Review Completed: ??? N/A

Context for Next Step:

Types are defined and importable from evaluation

Step 1.2: RandomPolicy Implementation

Slice: S1 Goal: Implement the RandomPolicy class that selects random actions based on observation state.

Files:

evaluation/green_agent.py - modify - add RandomPolicy class

Interface Changes:

New: RandomPolicy.__init__(seed: int | None = None)
New: RandomPolicy.select_action(observation: SQLObservation) -> SQLAction

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: ??? Completed

Completed: 2026-03-27T23:55:10Z Changes Made:

Implemented RandomPolicy in evaluation/green_agent.py with seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction.
Updated evaluation/__init__.py to re-export RandomPolicy from the public evaluation API.
Added tests/test_evaluation.py with focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction).

Result:

Outcome: ???

Evidence Captured:

Command: uv run --with pytest pytest tests/test_evaluation.py -v
Result: 6 passed
Scope: RandomPolicy unit coverage for F005 Step 1.2

Command: uv run --with pytest pytest tests/ -v
Result: 106 passed, 1 skipped
Scope: Full regression after RandomPolicy implementation

Tests run: uv run --with pytest pytest tests/test_evaluation.py -v; uv run --with pytest pytest tests/ -v
Notes:
- RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step.
- Schema parsing intentionally handles both - table and - table: columns... observation formats.
- Verification commands in the spec referenced tests/unit/...; this repo uses a flat tests/ layout, so tests were added in tests/test_evaluation.py.
Issues: None
Follow-ups Created: None
Human Review Completed: ??? N/A

Context for Next Step:

RandomPolicy is implemented and exported from the public evaluation API
Ready to implement evaluate() using per-episode loop and error isolation

Step 2.1: evaluate() Function

Slice: S2 Goal: Implement the core evaluate() function with incremental collection and error isolation.

Files:

evaluation/green_agent.py - modify - add evaluate() function
evaluation/__init__.py - modify - add evaluate to re-exports

Interface Changes:

New: evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: ??? Completed

Completed: 2026-03-27T23:59:28Z Changes Made:

Added evaluate() to evaluation/green_agent.py with per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation.
Added evaluate to evaluation/__init__.py public exports.
Extended tests/test_evaluation.py with unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior.

Result:

Outcome: ???

Evidence Captured:

Command: uv run --with pytest pytest tests/test_evaluation.py -v
Result: 14 passed
Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1

Command: uv run --with pytest pytest tests/ -v
Result: 114 passed, 1 skipped
Scope: Full regression after evaluate() implementation

Tests run: uv run --with pytest pytest tests/test_evaluation.py -v; uv run --with pytest pytest tests/ -v
Notes:
- evaluate() computes aggregates using completed episodes only (error is None), matching the error-isolation behavior in the spec data flow.
- Progress callback is invoked once per attempted episode, including episodes that fail.
- Repository environment still does not include pytest by default, so verification used uv run --with pytest.
Issues: None
Follow-ups Created: None
Human Review Completed: ??? N/A

Context for Next Step:

evaluate() is implemented, exported, and covered by focused unit tests
Next step should add/expand integration coverage with a real SQLEnvironment evaluation run

Step 2.2: Integration Test with SQLEnvironment

Slice: S2 Goal: Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment.

Files:

tests/test_evaluation.py - create - unit tests for types + RandomPolicy + evaluate(); integration test with real env

Interface Changes: None (test-only step).

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: [x] Low | [ ] Medium | [ ] High

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T00:04:03Z Changes Made:

Added _build_sql_environment() test helper in tests/test_evaluation.py to spin up a real SQLite-backed SQLEnvironment with a deterministic question fixture.
Added test_evaluate_integration_with_sql_environment validating end-to-end evaluate() execution over 10 episodes with aggregate-metric consistency checks.
Added test_evaluate_integration_is_deterministic_with_seeds validating deterministic full-result equality when both policy and environment seeds are fixed.

Result:

Outcome: Fully Successful

Evidence Captured:

Command: uv run --with pytest pytest tests/test_evaluation.py -v
Result: 16 passed
Scope: evaluation unit + integration coverage including real SQLEnvironment flow

Command: uv run --with pytest pytest tests/ -v
Result: 116 passed, 1 skipped
Scope: full project regression after adding integration coverage

Tests run: uv run --with pytest pytest tests/test_evaluation.py -v; uv run --with pytest pytest tests/ -v
Notes:
Integration tests were implemented in tests/test_evaluation.py to match this repository's flat test layout.
Verifier gate approved finalization in MVP mode after test evidence review.
Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

All implementation steps are complete and verification gate passed.

8. Rollout Considerations

Feature Flags

Required: No
Flag name: N/A

Migration

Data migration needed: No
Migration strategy: N/A

Rollback Plan

Delete the evaluation/ directory. No other code references it.

9. Execution Tracking

All execution state is tracked within this document:

Section 1a: Overall progress summary
Section 7: Per-step completion details, test results, and handoff context
FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

Checking Section 1a for summary
Reviewing Section 7 for detailed step status
Inspecting the feature's progress and status fields in FEATURES.json
Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.

10. User Value Summary

Status: Generated

What Users Can Now Do

Run automated evaluation of any policy over N episodes with evaluate(env, policy, n_episodes=100) and get structured metrics including success rate, average reward, average steps, and per-episode breakdown.

How to Access/Test

from evaluation import evaluate, RandomPolicy
from server.sql_environment import SQLEnvironment

env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer)
policy = RandomPolicy(seed=42)
result = evaluate(env, policy, n_episodes=10, seed=42)
print(f"Success rate: {result.success_rate:.1%}")
print(f"Avg reward: {result.avg_reward:.3f}")

Demo

Command: uv run python -c "from evaluation import evaluate, RandomPolicy; ..."

Release Notes Snippet

Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance.

11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

PR Title

feat(evaluation): complete green agent wrapper integration and finalization

PR Summary

Add deterministic integration coverage for evaluate() against a real SQLEnvironment fixture.
Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation.
Capture durable learnings under docs/learnings/ for evaluation patterns and deterministic testing.

Verification

uv run --with pytest pytest tests/test_evaluation.py -v
uv run --with pytest pytest tests/ -v

Follow-up

All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10

Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

A step requires touching more than 3 files in unrelated areas
You need to introduce multiple new abstractions "just in case"
Verification cannot be made targeted and concrete
You discover new unknowns that change the plan materially
The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.

Human Checkpoint

Before handing to AI agent:

Interface specifications are complete
Data flow is accurate
Error handling is specified
Implementation order makes sense
VERIFICATION_SPEC.md has been generated

Questions:

Any remaining concerns?
Anything agent should know?

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly

Specification completed: 2026-03-27 Approved by: -- Verification spec: VERIFICATION_SPEC.md Verification input: F005-VERIFICATION_INPUT.json Target agent: Claude Code