# Research Summary **Project:** SQLEnv **Change:** F005 — Green Agent Wrapper (automated evaluation) **Date:** 2026-03-27 **Status:** Draft --- ## 1. Change Overview ### What We're Changing Create an automated evaluation wrapper that runs N episodes with a given policy and reports metrics (success_rate, avg_reward, avg_steps). Includes a built-in random baseline policy. Follows the OpenEnv Green Agent pattern. ### Why We're Changing It Required by competition evaluation criteria. Enables training comparison: "random policy gets 5% success, trained model gets 40%." Single command, structured output. ### Success Criteria - Single function call: `evaluate(n_episodes=100)` returns clean metrics dict - Built-in random policy for instant baseline comparison - Results include per-episode breakdown for analysis - Doesn't crash partway through and lose results --- ## 2. System Context ### Current Behavior No evaluation wrapper exists. Manual testing only via `tests/test_smoke.py`. ### Architecture Context ``` evaluate(env, policy, n_episodes) ├── for each episode: │ ├── env.reset() │ ├── while not done: policy.select_action(obs) → env.step(action) │ └── collect {correct, total_reward, steps} └── aggregate → {success_rate, avg_reward, avg_steps, per_episode} ``` Client-side component — uses environment through public `reset()`/`step()` API. ### Entry Points | Entry Point | Trigger | Current Flow | |-------------|---------|--------------| | `evaluate()` | Training script or CLI | **To be created** | | `RandomPolicy.select_action()` | Called by evaluate loop | **To be created** | ### Data Flow | Data | Source | Shape/Type | Destination | |------|--------|------------|-------------| | Observation | `env.reset()` / `env.step()` | `SQLObservation` | Policy | | Action | Policy | `SQLAction` | `env.step()` | | Episode results | Loop | `list[EpisodeResult]` | Aggregation | | Metrics | Aggregation | `dict` | Caller | --- ## 3. Dependencies ### Code We Depend On | Dependency | What We Use | Risk if Changed | |------------|-------------|-----------------| | `models.py:SQLAction, SQLObservation` | Action/observation types | Stable (F001 complete) | | `sql_environment.py:SQLEnvironment` | `reset()`, `step()` API | Stable (F001 complete) | ### Code That Depends On Us | Dependent | How They Use Us | Impact of Our Change | |-----------|-----------------|---------------------| | F006 (GRPO Training) | Baseline comparison + evaluation | Provides metrics API | | F007 (HF Submission) | Demo results for blog | Produces numbers | --- ## 4. Risks & Edge Cases ### Identified Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Evaluation crashes partway | Medium | Loses results | Collect incrementally, return partial on error | | No progress indicator | Medium | User thinks hung | Optional tqdm or callback | ### Edge Cases to Handle | Edge Case | Current Behavior | Required Behavior | |-----------|------------------|-------------------| | n_episodes=0 | N/A | Return empty metrics | | Policy exception mid-episode | N/A | Catch, record as failed, continue | | Environment reset fails | N/A | Skip, log warning, continue | ### Invariants to Preserve - [ ] Evaluation is read-only — doesn't modify environment between episodes - [ ] Random policy is deterministic given a seed - [ ] Metrics match manual calculation --- ## 4b. Code Shape & Design Target ### Target Shape | Component | Purpose | Why This Boundary | |-----------|---------|-------------------| | `evaluate(env, policy, n_episodes, seed)` | Main entry | Single public function | | `RandomPolicy` | Built-in random baseline | Needed for comparison | | `Policy` (Protocol) | Type hint for custom policies | Duck typing | | `EpisodeResult` (dataclass) | Per-episode metrics | Clean structure | ### Abstraction Level - **Recommendation:** One module `green_agent.py` at project root. Function + dataclass + random policy class. ### Anti-Patterns to Avoid - Don't create elaborate policy class hierarchy - Don't couple to WebSocket transport — work with local env directly - Don't add visualization/plotting (MVP) --- ## 5. Constraints | Constraint | Requirement | Notes | |------------|-------------|-------| | No new heavy deps | tqdm optional | Keep lean | | Works with local env | Direct SQLEnvironment | Primary use case | | Seedable | Reproducible results | Random policy + env seed | --- ## 6. Open Questions | Question | Why It Matters | Who Can Answer | |----------|----------------|----------------| | Module location: `green_agent.py` at root? | Naming | Recommend root, matches concept doc | | Should RandomPolicy use schema info for smarter random? | Baseline quality | Recommend simple random | --- ## 7. Context Sources | Source | Type | Notes | |--------|------|-------| | `docs_draft/SQLEnv_Concept_v1.md` Appendix C | Doc | SQLGreenAgent sketch | | `server/sql_environment.py` | Code | reset()/step() API | | `models.py` | Code | SQLAction, SQLObservation |