# System Behavior: evaluation

> Living document. Updated by `/archive-spec` when features are completed.
> Last archived: F011 on 2026-04-07

---

## Added

### Automated multi-episode evaluation
<!-- since: F005 | test: tests/test_evaluation.py::test_evaluate_returns_correct_metrics -->

The system accepts an environment, a policy, and an episode count, then produces an EvaluationResult containing success_rate, avg_reward, avg_steps, and a per-episode breakdown. Evaluation runs all requested episodes and returns structured metrics in a single call.

### Incremental result collection on failure
<!-- since: F005 | test: tests/test_evaluation.py::test_evaluate_survives_episode_failure -->

When an individual episode fails (environment error or policy error), the system records the failure in the per-episode breakdown and continues evaluating remaining episodes. Partial results are never lost.

### Random baseline policy
<!-- since: F005 | test: tests/test_evaluation.py::test_random_policy_deterministic -->

The system provides a built-in random policy that accepts an SQLObservation and returns a random SQLAction. Given the same seed, the random policy produces identical action sequences across runs.

### Progress callback during evaluation
<!-- since: F005 | test: tests/test_evaluation.py::test_progress_callback_called -->

The evaluate function accepts an optional progress callback that receives (current_episode, total_episodes) after each episode completes, enabling progress reporting for long evaluation runs.

### Oracle policy baseline available for evaluation
<!-- since: F009 | test: tests/unit/test_oracle_policy.py::test_normal_episode_action_sequence -->

The evaluation module accepts an `OraclePolicy` that, given the same question list as the environment, produces a deterministic optimal action sequence per episode (DESCRIBE relevant tables, QUERY with gold SQL, ANSWER with gold answer). When run through `evaluate()`, the oracle returns near-perfect success rate and ~1.3 total reward, serving as an upper-bound baseline for comparison against random and trained policies.

### Oracle graceful fallback on unknown questions
<!-- since: F009 | test: tests/unit/test_oracle_policy.py::test_unknown_question_fallback -->

When the oracle encounters a question not present in its lookup, it returns an ANSWER action with an empty string rather than raising an error. The episode is marked incorrect but the evaluation run continues without interruption.

### Compare-methods notebook produces prompting-vs-GRPO accuracy view
<!-- since: F011 -->

The system provides a notebook evaluation flow that runs a shared held-out subset and renders a side-by-side comparison table and bar chart for zero-shot, 1-shot, 3-shot, GRPO no-think, and GRPO thinking conditions.

### GRPO checkpoint evaluation degrades gracefully when repos are unavailable
<!-- since: F011 -->

When a configured GRPO checkpoint cannot be loaded from HF Hub, the notebook emits a warning and skips that condition while continuing remaining evaluations so users still receive partial comparison output.

### LLM tool-calling policy converts model outputs into SQL actions
<!-- since: F011 -->

The notebook policy feeds tool schemas through the chat template, parses `<tool_call>` JSON blocks, and maps them to structured SQLAction objects; unparseable generations fall back to an ANSWER action so evaluation continues without crashing.