# Research Summary

**Project:** SQLEnv
**Change:** F005 — Green Agent Wrapper (automated evaluation)
**Date:** 2026-03-27
**Status:** Draft

---

## 1. Change Overview

### What We're Changing
Create an automated evaluation wrapper that runs N episodes with a given policy and reports metrics (success_rate, avg_reward, avg_steps). Includes a built-in random baseline policy. Follows the OpenEnv Green Agent pattern.

### Why We're Changing It
Required by competition evaluation criteria. Enables training comparison: "random policy gets 5% success, trained model gets 40%." Single command, structured output.

### Success Criteria
- Single function call: `evaluate(n_episodes=100)` returns clean metrics dict
- Built-in random policy for instant baseline comparison
- Results include per-episode breakdown for analysis
- Doesn't crash partway through and lose results

---

## 2. System Context

### Current Behavior
No evaluation wrapper exists. Manual testing only via `tests/test_smoke.py`.

### Architecture Context
```
evaluate(env, policy, n_episodes)
  ├── for each episode:
  │   ├── env.reset()
  │   ├── while not done: policy.select_action(obs) → env.step(action)
  │   └── collect {correct, total_reward, steps}
  └── aggregate → {success_rate, avg_reward, avg_steps, per_episode}
```

Client-side component — uses environment through public `reset()`/`step()` API.

### Entry Points

| Entry Point | Trigger | Current Flow |
|-------------|---------|--------------|
| `evaluate()` | Training script or CLI | **To be created** |
| `RandomPolicy.select_action()` | Called by evaluate loop | **To be created** |

### Data Flow

| Data | Source | Shape/Type | Destination |
|------|--------|------------|-------------|
| Observation | `env.reset()` / `env.step()` | `SQLObservation` | Policy |
| Action | Policy | `SQLAction` | `env.step()` |
| Episode results | Loop | `list[EpisodeResult]` | Aggregation |
| Metrics | Aggregation | `dict` | Caller |

---

## 3. Dependencies

### Code We Depend On

| Dependency | What We Use | Risk if Changed |
|------------|-------------|-----------------|
| `models.py:SQLAction, SQLObservation` | Action/observation types | Stable (F001 complete) |
| `sql_environment.py:SQLEnvironment` | `reset()`, `step()` API | Stable (F001 complete) |

### Code That Depends On Us

| Dependent | How They Use Us | Impact of Our Change |
|-----------|-----------------|---------------------|
| F006 (GRPO Training) | Baseline comparison + evaluation | Provides metrics API |
| F007 (HF Submission) | Demo results for blog | Produces numbers |

---

## 4. Risks & Edge Cases

### Identified Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Evaluation crashes partway | Medium | Loses results | Collect incrementally, return partial on error |
| No progress indicator | Medium | User thinks hung | Optional tqdm or callback |

### Edge Cases to Handle

| Edge Case | Current Behavior | Required Behavior |
|-----------|------------------|-------------------|
| n_episodes=0 | N/A | Return empty metrics |
| Policy exception mid-episode | N/A | Catch, record as failed, continue |
| Environment reset fails | N/A | Skip, log warning, continue |

### Invariants to Preserve

- [ ] Evaluation is read-only — doesn't modify environment between episodes
- [ ] Random policy is deterministic given a seed
- [ ] Metrics match manual calculation

---

## 4b. Code Shape & Design Target

### Target Shape

| Component | Purpose | Why This Boundary |
|-----------|---------|-------------------|
| `evaluate(env, policy, n_episodes, seed)` | Main entry | Single public function |
| `RandomPolicy` | Built-in random baseline | Needed for comparison |
| `Policy` (Protocol) | Type hint for custom policies | Duck typing |
| `EpisodeResult` (dataclass) | Per-episode metrics | Clean structure |

### Abstraction Level

- **Recommendation:** One module `green_agent.py` at project root. Function + dataclass + random policy class.

### Anti-Patterns to Avoid

- Don't create elaborate policy class hierarchy
- Don't couple to WebSocket transport — work with local env directly
- Don't add visualization/plotting (MVP)

---

## 5. Constraints

| Constraint | Requirement | Notes |
|------------|-------------|-------|
| No new heavy deps | tqdm optional | Keep lean |
| Works with local env | Direct SQLEnvironment | Primary use case |
| Seedable | Reproducible results | Random policy + env seed |

---

## 6. Open Questions

| Question | Why It Matters | Who Can Answer |
|----------|----------------|----------------|
| Module location: `green_agent.py` at root? | Naming | Recommend root, matches concept doc |
| Should RandomPolicy use schema info for smarter random? | Baseline quality | Recommend simple random |

---

## 7. Context Sources

| Source | Type | Notes |
|--------|------|-------|
| `docs_draft/SQLEnv_Concept_v1.md` Appendix C | Doc | SQLGreenAgent sketch |
| `server/sql_environment.py` | Code | reset()/step() API |
| `models.py` | Code | SQLAction, SQLObservation |