sql_env / specs /F005-DEMO.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Feature Demo: F005 — Green Agent Wrapper

Generated: 2026-03-28T00:10:42Z Context source: spec + discovery only (implementation not read) Feature entry: FEATURES.json #F005


What This Feature Does

This feature lets you evaluate a policy over many episodes in one call and get structured results back, instead of manually stepping episodes and aggregating outcomes yourself. It is designed to answer practical questions like: “How does policy X perform over 100 episodes?”

From a user perspective, the key value is fast, repeatable comparison. You can use a built-in random baseline, run seeded evaluations for deterministic comparisons, and inspect both aggregate metrics and per-episode outcomes without losing the whole run if one episode fails.


What Is Already Proven

Verified in This Demo Run

  • Public evaluation API imports successfully (RandomPolicy, evaluate, result types).
  • evaluate(..., n_episodes=0) returns a valid zero-valued result object.
  • Integration/determinism verification tests passed locally against real SQLEnvironment flow (2 passed).
  • Progress-callback verification passed locally (1 passed).
  • Full F005 evaluation test file passed locally (16 passed).

Previously Verified Evidence

  • specs/FEATURES.json records approved verification evidence for F005:
    • Command: uv run --with pytest pytest tests/test_evaluation.py -v
    • Result: 16 passed
    • Verifier result: approved
    • Timestamp: 2026-03-28T00:04:03Z
  • specs/F005-IMPLEMENTATION_SPEC.md Step 2.2 records full-project regression evidence (116 passed, 1 skipped) after integration coverage was added.

What Still Needs User Verification

None.


Quickstart / Verification Steps

Run these commands to see the feature in action:

uv sync
uv run python -c "from evaluation import evaluate; r=evaluate(None, None, n_episodes=0); print(r)"
uv run --with pytest pytest tests/test_evaluation.py -v

Prerequisite: run from project root with dependencies available via uv.


Live Local Proof

Load the evaluation API

This confirms the user-facing evaluation surface is available from the package.

uv run python -c "from evaluation import RandomPolicy, evaluate, EpisodeResult, EvaluationResult; print('evaluation_api_import_ok')"
evaluation_api_import_ok

Notice that all primary public symbols for F005 import cleanly.

Run evaluate() in zero-episode mode

This demonstrates a documented boundary behavior of the evaluation call.

uv run python -c "from evaluation import evaluate; r=evaluate(None, None, n_episodes=0); print(f'n_episodes={r.n_episodes} n_completed={r.n_completed} success_rate={r.success_rate} avg_reward={r.avg_reward} avg_steps={r.avg_steps} episodes={len(r.episodes)}')"
n_episodes=0 n_completed=0 success_rate=0.0 avg_reward=0.0 avg_steps=0.0 episodes=0

Notice that the function returns a clean structured result instead of failing on this edge input.

Verify real-environment integration and seeded determinism

This checks the core happy-path flow with real environment integration and repeatable seeded behavior.

uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_integration_with_sql_environment or test_evaluate_integration_is_deterministic_with_seeds"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpxjssag/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 16 items / 14 deselected / 2 selected

tests/test_evaluation.py::test_evaluate_integration_with_sql_environment PASSED [ 50%]
tests/test_evaluation.py::test_evaluate_integration_is_deterministic_with_seeds PASSED [100%]

======================= 2 passed, 14 deselected in 4.29s =======================

Notice both integration behavior and seed determinism passed in this run.


Existing Evidence

  • Verification spec reference: specs/F005-VERIFICATION_SPEC.md
  • Implementation-step evidence: specs/F005-IMPLEMENTATION_SPEC.md (Step 2.2)
  • Feature registry evidence: specs/FEATURES.jsonfeatures[id=F005].verification_evidence

Manual Verification Checklist

No additional manual verification required.


Edge Cases Exercised

Zero and negative episode counts

uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_negative_episodes_raises or test_evaluate_zero_episodes_returns_zero_values"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpBSdLqD/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 16 items / 14 deselected / 2 selected

tests/test_evaluation.py::test_evaluate_zero_episodes_returns_zero_values PASSED [ 50%]
tests/test_evaluation.py::test_evaluate_negative_episodes_raises PASSED  [100%]

======================= 2 passed, 14 deselected in 4.02s =======================

This matters because F005 must handle both boundary (0) and invalid (-1) episode requests predictably.

Progress callback behavior during evaluation

uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_progress_callback_receives_episode_progress"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmp164LzQ/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 16 items / 15 deselected / 1 selected

tests/test_evaluation.py::test_evaluate_progress_callback_receives_episode_progress PASSED [100%]

======================= 1 passed, 15 deselected in 3.78s =======================

This matters because progress visibility was an explicit anti-frustration requirement.


Test Evidence (Optional)

Supplementary proof that the feature works correctly across all scenarios. The Live Demo section above shows usage surfaces; this section shows broader verification coverage.

Test Suite Tests Status
F005 evaluation tests (tests/test_evaluation.py) 16 All passed

Representative command:

uv run --with pytest pytest tests/test_evaluation.py -v

Representative output summary:

============================== 16 passed in 4.05s ==============================

Feature Links

  • Implementation spec: specs/F005-IMPLEMENTATION_SPEC.md
  • Verification spec: specs/F005-VERIFICATION_SPEC.md

Demo generated by feature-demo agent. Re-run with /feature-demo F005 to refresh.