test-rl-hackathon-budget

Sleeping

App Files Files Community

Akshay Babbar commited on 30 days ago

Commit

98a5a8c

0 Parent(s):

chore: HF Space export (size filter)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.cursor/BUGBOT.md +102 -0
.dockerignore +28 -0
.gitignore +31 -0
.openenvignore +8 -0
Dockerfile +40 -0
LICENSE.md +201 -0
README.md +344 -0
REPRODUCIBILITY.md +406 -0
__init__.py +0 -0
app_gradio.py +408 -0
blog.md +257 -0
budget_router/__init__.py +17 -0
budget_router/client.py +29 -0
budget_router/environment.py +515 -0
budget_router/models.py +212 -0
budget_router/policies.py +141 -0
budget_router/reward.py +281 -0
budget_router/tasks.py +108 -0
budget_router/tests/__init__.py +1 -0
budget_router/tests/test_environment.py +502 -0
budget_router/tests/test_eval_all_seed_selection.py +41 -0
budget_router/tests/test_grpo_training_reward.py +154 -0
budget_router/tests/test_inference_prompt.py +94 -0
budget_router/tests/test_trace_episode.py +43 -0
budget_router/tests/test_validation.py +140 -0
budget_router/validation.py +424 -0
check_leak.py +181 -0
client.py +3 -0
eval/eval_all.py +306 -0
eval/eval_all.sh +116 -0
eval/outputs/prompt_audit/belief_v1_dev10/eval_results_20260425_160429.json +1188 -0
eval/outputs/prompt_audit/belief_v1_dev10/eval_summary_20260425_160429.md +5 -0
eval/outputs/prompt_audit/belief_v1_heldout5/eval_results_20260425_160016.json +615 -0
eval/outputs/prompt_audit/belief_v1_heldout5/eval_summary_20260425_160016.md +5 -0
eval/outputs/prompt_audit/budget_guard_alltasks_dev3/eval_results_20260425_165910.json +1468 -0
eval/outputs/prompt_audit/budget_guard_alltasks_dev3/eval_summary_20260425_165910.md +8 -0
eval/outputs/prompt_audit/budget_guard_dev10/eval_results_20260425_164343.json +1202 -0
eval/outputs/prompt_audit/budget_guard_dev10/eval_summary_20260425_164343.md +5 -0
eval/outputs/prompt_audit/budget_guard_heldout5/eval_results_20260425_163956.json +617 -0
eval/outputs/prompt_audit/budget_guard_heldout5/eval_summary_20260425_163956.md +5 -0
eval/outputs/trace_compare/eval_seed101/eval_results_20260425_192545.json +149 -0
eval/outputs/trace_compare/eval_seed101/eval_results_20260425_192656.json +149 -0
eval/outputs/trace_compare/eval_seed101/eval_summary_20260425_192545.md +5 -0
eval/outputs/trace_compare/eval_seed101/eval_summary_20260425_192656.md +5 -0
eval/trace_episode.py +357 -0
eval_sft.py +488 -0
generate_sft_data.py +361 -0
gradio_ui/__init__.py +0 -0
gradio_ui/config.py +19 -0
gradio_ui/legacy_api.py +56 -0

.cursor/BUGBOT.md ADDED Viewed

	@@ -0,0 +1,102 @@

+# Bugbot review charter — Budget Router (OpenEnv)
+## North star
+This codebase is an **OpenEnv-style RL / agent environment**: correctness of the
+simulation, inference path, and evaluation harness is **non-negotiable**. Treat
+every change first as a **risk to invariants**, second as a product of intent.
+**Priority order (strict):**
+1. **Factual and behavioral accuracy** — claims, metrics, seeds, APIs, and
+   documented procedures must remain true and reproducible.
+2. **Regression safety** — no silent change to reward semantics, observation
+   space, routing contracts, seed selection, or eval aggregation unless
+   explicitly justified and reflected in docs.
+3. **Everything else** — including new features, refactors, and ergonomics —
+   only after the above are satisfied.
+If a change improves developer experience or adds capability but **weakens
+traceability, determinism, or agreement with the published contract**, treat that
+as a **defect**, not a win.
+---
+## Evidence contract
+`README_v1.md` is the **published evidence layer** for this repository: benchmark
+definitions, honest scope, statistical reporting, seed buckets, and
+environmental assumptions. It is not marketing copy; it is the **external
+interface of trust**.
+When reviewing a pull request:
+- Assume reviewers and downstream users will reconcile the diff against
+  **`README_v1.md`** and the **test suite**, not against intent expressed only in
+  comments or chat.
+- Flag any drift between **implementation**, **eval scripts**, and **documented
+  claims** as a **primary finding**, not a footnote.
+- Prefer **blocking** feedback when the PR could make a true statement in
+  `README_v1.md` false, ambiguous, or non-reproducible without a coordinated doc
+  update.
+---
+## Regression lens (main code and agent path)
+Evaluate from the perspective of **“what breaks for callers?”** — the Gradio /
+server surface, the environment stepping contract, inference and routing logic,
+and anything an **agent** (heuristic, LLM, or RL policy) depends on.
+Elevate severity when the change touches or could affect:
+- **Reward / termination / budget / SLA semantics** — any path that alters
+  episode economics without a clear, tested migration story.
+- **Observations and action validity** — shapes, bounds, masking, or
+  interpretation of noisy signals the agent is documented to use.
+- **Provider degradation or non-stationarity** — ordering, timing, or randomness
+  that shifts the task without explicit versioning or changelog discipline.
+- **Evaluation** — `eval/` entrypoints, seed handling, aggregation, baselines, and
+  anything that feeds headline numbers or comparisons in `README_v1.md`.
+- **Determinism and auditability** — anything that makes prior results
+  incomparable across commits without saying so.
+Ask explicitly: **If we merge this, can a user still run the same commands and
+obtain a result that is fairly comparable to what the README describes?** If the
+answer is “only sometimes” or “only with undocumented flags,” that is a **merge
+risk**.
+---
+## Code review bar
+Hold the diff to a **high-trust research engineering** standard:
+- **Invariants first** — state what must remain true; show how the change
+  preserves or formally relaxes it.
+- **Proof over taste** — prefer runnable tests, property checks, or minimal
+  reproductions over stylistic preference. Style matters only where it prevents
+  bugs (e.g., unclear units, magic numbers without provenance).
+- **Minimal blast radius** — favor localized, reversible changes; be skeptical of
+  drive-by refactors bundled with behavioral edits.
+- **Failure modes** — consider partial deploys, missing API keys, degraded
+  backends, and off-by-one episode boundaries as first-class scenarios when
+  relevant.
+Do **not** optimize review comments for velocity of shipping features. Optimize
+for **confidence that main remains a reliable substrate for agents and eval**.
+---
+## What “approve” means here
+A non-issue or acceptable change is one that **preserves or strengthens** the
+truth and stability story relative to `README_v1.md` and existing tests.
+A blocking issue is one that **could** — even in edge cases — produce **wrong
+results**, **misleading comparisons**, **undocumented behavior change**, or
+**silent regression** in core or agent-facing paths without a commensurate,
+explicit update to evidence and tests.
+When uncertain, **assume the worst plausible interpretation** for merge safety,
+state the assumption, and recommend what evidence would resolve it.

.dockerignore ADDED Viewed

	@@ -0,0 +1,28 @@

+.venv/
+.git/
+__pycache__/
+*.pyc
+.pytest_cache/
+.ruff_cache/
+docs/*.png
+**/*.png
+*.tar.gz
+*.egg-info/
+.env
+.claude/
+.windsurf/
+.hf_private/
+.DS_Store
+artifacts/
+*.zip
+*.json
+**/*.json
+*.txt
+**/*.txt
+.hackathon_context/
+README_archived.md
+llm_stderr.log
+pre_validation.sh
+test_docker_step.sh
+trained_models/

.gitignore ADDED Viewed

	@@ -0,0 +1,31 @@

+.venv/
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+.env
+.hf_private/
+.windsurf/
+.matplotlib/
+.DS_Store
+*.egg-info/
+dist/
+build/
+*.png
+!figures/budget_router_evidence.png
+._*
+merged_codebase.txt
+docs/
+/*.json
+*.tar.gz
+/*.txt
+README_archived.md
+llm_stderr.log
+pre_validation.sh
+test_docker_step.sh
+.hackathon_context/
+outputs/*/
+.cursor/
+.docs/

.openenvignore ADDED Viewed

	@@ -0,0 +1,8 @@

+*.png
+*.tar.gz
+README_archived.md
+*.json
+docs/
+*.txt
+*.zip
+trained_models/

Dockerfile ADDED Viewed

	@@ -0,0 +1,40 @@

+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git curl && \
+    rm -rf /var/lib/apt/lists/*
+COPY . /app/env
+WORKDIR /app/env
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --extra training --no-install-project --no-editable
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --extra training --no-editable
+FROM ${BASE_IMAGE}
+WORKDIR /app
+COPY --from=builder /app/env/.venv /app/.venv
+COPY --from=builder /app/env /app/env
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+EXPOSE 8000
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import os, urllib.request; port = os.environ.get('PORT', '8000'); urllib.request.urlopen(f'http://127.0.0.1:{port}/health', timeout=2)"
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port ${PORT:-8000} --proxy-headers --forwarded-allow-ips='*'"]

LICENSE.md ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md ADDED Viewed

	@@ -0,0 +1,344 @@

+---
+title: "Budget Router"
+emoji: "⚙️"
+colorFrom: purple
+colorTo: indigo
+sdk: docker
+app_port: 8000
+base_path: /web
+pinned: false
+---
+# Budget Router (OpenEnv)
+Budget Router is an OpenEnv-compliant RL environment where an agent routes requests to one of three providers (A/B/C) or sheds load under a tight **cost–reliability–SLA** trade-off. Providers degrade non-stationarily within an episode; the agent observes only a noisy windowed success signal (rolling success rate), not true internal health.
+[![HF Space](https://img.shields.io/badge/🤗-Live%20Demo-yellow)](https://huggingface.co/spaces/akshay4/budget-router-openenv)
+## TL;DR
+**Hard_Multi is the headline scenario**: when Provider A degrades from step 0 and
+Provider B cascades at step 10, reactive policies go negative while adaptive ones
+stay positive. Three policy families, each stronger than the last, validated
+across **30 paired seeds** in three independent buckets (dev, heldout, fresh):
+| Policy | Hard_Multi grader | vs heuristic | Statistical evidence |
+|---|---:|---|---|
+| Heuristic (reactive) | 0.6076 ± 0.0361 (n=30) | — | — |
+| LLM — Qwen2.5-72B + budget-guard | 0.6515 ± 0.0523 (n=30) | **+7.2 %** | Cohen's d = **1.135** (large), paired one-sided p < 1×10⁻⁶, 24/30 wins, bootstrap 95 % CI on Δ = [0.031, 0.058] |
+| PPO — SB3, 100k steps | **0.6907 ± 0.0326** (n=10 dev) | **+13.6 %** | 95 % CI [0.667, 0.714], **non-overlapping with heuristic**, 10/10 wins |
+**Mechanism** (PPO): the agent learned to route A→B early and conserve budget
+before B's cascade at step 10, pushing `adaptation_score` from 0.6907 (heuristic)
+to **0.9328** — a +0.2421 gain on the grader's most diagnostic sub-score. The
+LLM achieves a milder version of the same effect (+0.124 adaptation gain
+across n=30) by anticipating the cascade in-context.
+**Environment hardness**: heuristic reward goes negative (−2.97) on
+Hard_Multi while oracle reaches +4.10 — a 7.07-point gap (≈238 % of the
+heuristic's absolute reward) that confirms the cascade task is hard enough
+to require RL/in-context reasoning and learnable enough to reward it.
+**Honest scope** (explicitly disclosed):
+- The LLM uses a deterministic **budget-safety guard** that vetoes routes which
+  would bankrupt the budget — a standard agentic-system pattern (LLM for
+  high-level decisions, deterministic layer for arithmetic-critical safety).
+  Without the guard, raw LLM occasionally exhausts budget and incurs the −10
+  cliff penalty.
+- LLM (with guard) wins on **3 of 4 task tiers**: Medium (+5.8 %), Hard (+7.5 %),
+  Hard_Multi (+11.0 %). Loses Easy by −4.6 % — on a task with no degradation,
+  the budget-conservative heuristic is near-optimal and the LLM's added
+  flexibility is unhelpful.
+- PPO is trained and evaluated on **Hard_Multi only**; not a general-purpose
+  policy. This is a deliberate choice — Hard_Multi has a 238 % oracle/heuristic
+  gap, the largest in the suite, so RL signal is highest there.
+- All non-trivial improvement claims come from seeds the policy never saw
+  during design (heldout 100–109, fresh 200–209). Dev-seed wins are reported
+  separately and never used to make the headline claim.
+## Run locally
+**Enable LLM policy locally**:
+```bash
+export API_BASE_URL="https://<openai-compatible-endpoint>/v1"  # e.g. https://router.huggingface.co/v1
+export API_KEY="<your_key>"
+export MODEL_NAME="<model_id>"  # optional (e.g. Qwen/Qwen2.5-72B-Instruct)
+```
+```bash
+uv sync --extra training
+uv run server
+```
+Then open `http://127.0.0.1:8000/web` for the Gradio dashboard.
+To **reproduce or regenerate** the evaluation numbers, traces, PPO workflow, and optional GRPO checks, follow the command checklist in [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md) (companion to the optional `<details>` blocks below).
+To **reproduce or regenerate** the evaluation numbers, traces, PPO workflow, and optional GRPO checks, follow the command checklist in [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md) (companion to the optional `<details>` blocks below).
+## Benchmark results
+Three policies evaluated:
+- **Heuristic**: budget-aware, cheapest-viable baseline using only public
+  observations (`budget_router/policies.py`).
+- **LLM**: Qwen2.5-72B via HuggingFace Inference Router, wrapped with a
+  deterministic budget-safety guard (`inference.py::_apply_budget_safety_guard`).
+- **PPO**: MlpPolicy trained with Stable-Baselines3 on Hard_Multi (100k steps,
+  4 parallel envs). See `train/train_ppo_hard_multi.py`.
+- **Oracle†**: privileged upper-bound with internal-state access,
+  validation-only, not reported in tables.
+**Dev seeds (0–9), full task suite** — `outputs/freeze_check_alltasks_dev10/eval_summary_*.md`:
+| Task | Heuristic | LLM | PPO | LLM Δ vs heuristic |
+|---|---:|---:|---:|---|
+| Easy | 0.7718 | 0.7360 | — | −4.6 %  *(7 losses, 0 wins, 3 ties)* |
+| Medium | 0.6852 | 0.7250 | — | **+5.8 %**  *(9 wins, 0 losses, 1 tie)* |
+| Hard | 0.6354 | 0.6832 | — | **+7.5 %**  *(8 wins, 2 losses, 0 ties)* |
+| Hard_Multi | 0.6078 | 0.6746 | **0.6907** | **+11.0 %**  *(8 wins, 1 loss, 1 tie)* |
+PPO was trained and evaluated on Hard_Multi only; Easy/Medium/Hard cells are
+intentionally blank (no model for those tasks).
+**Statistical evidence — Hard_Multi** (`outputs/freeze_check_*/eval_results_*.json`,
+`outputs/ppo_hard_multi_eval.json`):
+| | Heuristic | LLM | PPO |
+|---|---|---|---|
+| Mean grader | 0.6076 ± 0.0361 (n=30) | 0.6515 ± 0.0523 (n=30) | 0.6907 ± 0.0326 (n=10) |
+| Bootstrap 95 % CI | [0.595, 0.620] | [0.633, 0.670] | [0.667, 0.714] |
+| Paired Δ vs heuristic | — | +0.0440 (boot 95 % CI [0.031, 0.058]) | +0.0829 |
+| **Cohen's d (paired)** | — | **1.135  (LARGE)** | **≈ 2.4  (HUGE)** |
+| Paired one-sided p | — | **< 1 × 10⁻⁶** (Welch t = 6.22, df = 29) | (10/10 wins) |
+| Sign-test wins / ties / losses | — | **24 / 3 / 3** | 10 / 0 / 0 |
+| P(LLM > heuristic) — Agarwal 2021 | — | **0.80** | 1.00 |
+| IQM of paired Δ — Agarwal 2021 | — | +0.040 (trimmed 25 %) | — |
+| 95 % CI overlap with heuristic | — | None on the Δ | **None on the means** |
+| Adaptation sub-score (mean) | 0.6878 | 0.8115 | **0.9328** |
+**Per-bucket reproduction** (each row independent; LLM and heuristic share seeds,
+so deltas are paired):
+| Bucket | Seeds | Heuristic | LLM | Δ (rel %) | Wins / Ties / Losses |
+|---|---|---:|---:|---:|---:|
+| Dev | 0–9 | 0.6078 ± 0.0382 | 0.6746 ± 0.0486 | +0.0668 (+11.0 %) | 8 / 1 / 1 |
+| **Heldout** | 100–109 | 0.6064 ± 0.0419 | 0.6454 ± 0.0497 | **+0.0390 (+6.4 %)** | **8 / 2 / 0** |
+| **Fresh** | 200–209 | 0.6086 ± 0.0314 | 0.6347 ± 0.0551 | **+0.0261 (+4.3 %)** | **8 / 0 / 2** |
+| **Combined non-dev** | 100–109 + 200–209 | 0.6075 | 0.6401 | **+0.0326 (+5.4 %)** | **16 / 2 / 2** |
+![Budget Router Evidence](figures/budget_router_evidence.png)
+*Figure: (top-left) LLM advantage grows with task difficulty; (top-right)
+three-policy ordering on Hard_Multi with non-overlapping 95% CIs;
+(bottom-left) generalization across independent seed buckets including
+post-freeze fresh seeds; (bottom-right) adaptation sub-score is the
+primary driver of LLM and PPO gains over the reactive heuristic.*
+The fresh-seed bucket (200–209) was added *after* the LLM prompt and budget
+guard were frozen. It exists specifically to falsify a "tuned-on-heldout"
+critique. The effect persists with no overlap to zero in the bootstrap CI.
+<details>
+<summary>🔬 Reproducing PPO Results (Optional)</summary>
+The trained PPO policy for the hard_multi scenario is included at
+`trained_models/ppo_hard_multi_100k.zip` (143 KB, trained 100k steps).
+To reproduce the 10-seed evaluation locally:
+```bash
+# Install dependencies
+uv sync --extra training
+# Run evaluation (writes to outputs/ppo_hard_multi_eval.json)
+uv run python train/eval_hard_multi.py
+```
+Expected output: PPO mean = 0.691 ± 0.033 vs Heuristic mean = 0.608 ± 0.038,
+win_rate = 1.0 (10/10 seeds), non-overlapping 95 % CIs.
+> The deployed `inference.py` uses the LLM policy (Qwen2.5-72B + budget guard)
+> as required by the hackathon specification. PPO was trained offline to
+> validate environment depth and demonstrate that the task rewards genuine
+> RL learning beyond reactive or in-context policies.
+</details>
+<details>
+<summary>🔬 Reproducing LLM rigorous-stats Results (Optional)</summary>
+```bash
+# Dev (seeds 0-9), full task suite
+uv run python eval/eval_all.py \
+  --tasks easy --tasks medium --tasks hard --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seeds 10 --seed-set dev \
+  --out-dir outputs/freeze_check_alltasks_dev10
+# Heldout (seeds 100-109), Hard_Multi
+uv run python eval/eval_all.py \
+  --tasks hard_multi --policies heuristic --policies llm \
+  --seeds 10 --seed-set heldout \
+  --out-dir outputs/freeze_check_heldout10
+# Fresh (seeds 200-209), Hard_Multi — uses --seed-values for arbitrary seeds
+uv run python eval/eval_all.py \
+  --tasks hard_multi --policies heuristic --policies llm \
+  --seed-values "200,201,202,203,204,205,206,207,208,209" \
+  --out-dir outputs/freeze_check_fresh_200_209
+```
+All three runs combined produce the n=30 rigorous-stats table above.
+Episode-level JSON (per-step actions, rewards, sub-scores) is preserved
+under each `outputs/freeze_check_*/` directory.
+</details>
+## Why this benchmark has substance
+- **Partial observability**: the agent-visible observation contains only `provider_a/b/c_status`, `budget_remaining`, `queue_backlog`, `system_latency`, and `step_count` (`budget_router/models.py`). True provider health is internal.
+- **Non-stationarity**: task difficulty is created by explicit degradation schedules, culminating in Hard_Multi where A degrades from step 0 and B degrades from step 10 (`budget_router/tasks.py`).
+- **Coupled constraints**: queue backlog amplifies latency, so routing errors create downstream SLA pressure rather than just local failures (`budget_router/environment.py`).
+- **Meaningful evaluation**: the grader separately scores success, latency, budget, SLA, and adaptation; for Hard_Multi, adaptation is explicitly split across the two degradation windows (`budget_router/reward.py`).
+- **RL learnability confirmed**: a PPO agent trained from scratch in 100k steps
+  achieves non-overlapping 95 % CIs above the heuristic on Hard_Multi
+  (`train/eval_hard_multi.py`), confirming the cascade signal is learnable
+  beyond reactive or in-context policies.
+- **Anti-gaming, anti-overfitting tested**: 41 unit tests + 36 hard validation
+  assertions including degenerate-policy guards (always-A, always-B, always-shed
+  all dominated by baseline), grader-exploit guards (pure abstention scores
+  below 0.40 on Easy), heldout stability checks, and zero-NaN/zero-crash
+  invariants across 315 episodes.
+### Oracle–Baseline reward gap (verified, n=10 seeds each, dev set)
+| Scenario | Oracle† | Heuristic | Gap | Signal |
+|---|---|---|---|---|
+| Easy | +10.10 | +6.98 | 3.12 (45 %) | Heuristic competitive |
+| Medium | +9.49 | +2.53 | 6.96 (275 %) | Meaningful headroom |
+| Hard | +6.54 | +0.88 | 5.66 (643 %) | Heuristic nearly fails |
+| **Hard_Multi** | **+4.10** | **−2.97** | **7.07 (238 % of \|baseline\|)** | **Heuristic actively harmful** |
+*† Oracle has privileged access to internal provider health — theoretical ceiling only, not a deployable policy.*
+On Hard_Multi the heuristic reward goes negative (−2.97): the rule-based
+policy exhausts budget mid-cascade and actively destroys episode value.
+Oracle stays strongly positive (+4.10). The 7.07-point gap — 238 % above the
+heuristic's absolute reward — is what produces the large advantage signal that
+allows PPO to find a meaningful gradient in 100k steps and the LLM to find a
+Cohen's-d ≈ 1.1 effect zero-shot.
+```mermaid
+flowchart LR
+    subgraph Policy["Policy Layer"]
+        H["Heuristic"]
+        L["LLM (Qwen2.5-72B + budget guard)"]
+        P["PPO (SB3, Hard_Multi)"]
+    end
+    subgraph Env["BudgetRouterEnv (OpenEnv)"]
+        direction TB
+        O["Observation: provider_statuses, budget, backlog, latency, step"]
+        A["Actions: route_to_a, route_to_b, route_to_c, shed_load"]
+        R["Reward: success/fail + cost + SLA penalty, -10 on budget exhaustion"]
+        G["Episode grader: success, adaptation, latency, budget, SLA"]
+        O --> A --> R --> G
+    end
+    subgraph Tasks["Task presets"]
+        E["Easy"]
+        M["Medium"]
+        Hd["Hard"]
+        HM["Hard_Multi (cascade)"]
+    end
+    Policy -->|"action"| Env
+    Env -->|"obs + reward"| Policy
+    Tasks -->|"scenario config"| Env
+```
+## Tasks (what changes across difficulty)
+| Task | Budget ($) | Degradation schedule |
+|---|---:|---|
+| Easy | 1.00 | None (`degradation_start_step=999`) |
+| Medium | 0.95 | A degrades after step 5 (`rate=0.15`) |
+| Hard | 0.85 | A degrades from step 0 (`rate=0.15`) |
+| Hard_Multi | 1.10 | A degrades from step 0 (`rate=0.12`), then B from step 10 (`rate=0.10`) |
+Hard_Multi is the headline scenario: once B starts degrading at step 10, C becomes the only consistently reliable option. Since `cost_c=$0.10/request`, the final 10 steps alone can consume `$1.00` of the `$1.10` budget, making **early budget conservation** a binding constraint.
+## Grader (episode score)
+The episode grader is a weighted score in `[0,1]`:
+`overall = 0.30·success + 0.20·latency + 0.15·budget + 0.15·SLA + 0.20·adaptation`
+Notes (from `budget_router/reward.py`):
+- `success_score` is computed over **all episode steps** (shed-load/abstention is penalized).
+- `adaptation_score` evaluates post-degradation success. For Hard_Multi it is a blended window: 0.5×(after A degrades, before B) + 0.5×(after B degrades).
+## Evaluation protocol (reproducibility)
+- **Three independent seed buckets**: dev (0–9) used during policy design;
+  heldout (100–109) used to falsify dev-seed overfitting; fresh (200–209)
+  added *after* the LLM and PPO were frozen to falsify "tuned-on-heldout"
+  concerns. See `eval/eval_all.py::SEED_SETS` and the `--seed-values` CLI
+  option for arbitrary seed lists.
+- **Scripted runs**: `eval/eval_all.py` writes timestamped artifacts under
+  `outputs/`. Per-episode JSON includes per-step `actions`, `rewards`, and
+  the full grader sub-score breakdown.
+- **Statistical reporting**: We report Cohen's d, paired Welch t-test,
+  bootstrap 95 % confidence intervals, IQM, and probability of improvement
+  in line with [Agarwal et al. 2021 (NeurIPS Outstanding Paper)](https://arxiv.org/abs/2108.13264)
+  and [Henderson et al. 2018](https://arxiv.org/abs/1709.06560)'s reproducibility
+  recommendations. Sample size n=30 (combined buckets) exceeds the Colas
+  et al. 2018 recommended power-analysis floor for our observed effect size.
+- **Anti-cheating tests**: `budget_router/tests/test_environment.py::TestGraderSemantics`
+  verifies that pure abstention scores below 0.40 on Easy and that
+  partial abstention always scores worse than full service.
+## Getting started
+1. Install dependencies:
+```bash
+uv sync
+```
+2. (Optional, for LLM policy) set an OpenAI-compatible endpoint:
+```bash
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+export HF_TOKEN=...   # or API_KEY
+```
+3. Run evaluation (writes to `outputs/`):
+```bash
+# Single-task heldout reproduction
+uv run python eval/eval_all.py \
+  --tasks hard_multi --seed-set heldout --seeds 10 \
+  --policies heuristic --policies llm \
+  --out-dir outputs/heldout_repro
+# Full task suite, dev
+uv run python eval/eval_all.py \
+  --tasks easy --tasks medium --tasks hard --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seeds 10 --seed-set dev \
+  --out-dir outputs/dev_repro
+```
+## References
+- Altman (1999): *Constrained Markov Decision Processes*.
+- Henderson, Islam, Bachman, Pineau, Precup, Meger ([arXiv:1709.06560](https://arxiv.org/abs/1709.06560), AAAI 2018): *Deep Reinforcement Learning that Matters* — foundational reproducibility study; motivated multi-bucket seed evaluation here.
+- Colas, Sigaud, Oudeyer ([arXiv:1806.08295](https://arxiv.org/abs/1806.08295), 2018): *How Many Random Seeds? Statistical Power Analysis in Deep RL Experiments* — power-analysis basis for n=30.
+- Agarwal, Schwarzer, Castro, Courville, Bellemare ([arXiv:2108.13264](https://arxiv.org/abs/2108.13264), NeurIPS 2021 Outstanding Paper): *Deep RL at the Edge of the Statistical Precipice* — IQM, bootstrap CIs, probability-of-improvement adopted in the statistical-evidence table.

REPRODUCIBILITY.md ADDED Viewed

	@@ -0,0 +1,406 @@

+# Budget Router Reproducibility Guide
+This guide is a Pareto-optimal falsification checklist for Budget Router. Its goal is not to run every possible experiment; it is to quickly answer the questions most likely to invalidate the project claims:
+- Does the environment still behave like the source describes?
+- Does the grader still resist reward gaming and abstention exploits?
+- Does the heuristic remain a real baseline rather than a degenerate trick?
+- Does the LLM policy beat the heuristic for the right reasons, not just prompt or seed overfitting?
+- Does PPO still demonstrate learnability beyond reactive heuristics on `hard_multi`?
+Use the active `README.md` only as a claim surface and intuition source. The source of truth is the code: `budget_router/environment.py`, `budget_router/reward.py`, `budget_router/policies.py`, `budget_router/tasks.py`, `inference.py`, `eval/eval_all.py`, `eval/trace_episode.py`, `budget_router/validation.py`, and the tests under `budget_router/tests/`. Do not use archived README files for this analysis.
+## Mental Model
+Budget Router is a partially observable routing environment. A policy chooses one of:
+- `route_to_a`
+- `route_to_b`
+- `route_to_c`
+- `shed_load`
+The policy sees normalized public observations only: provider rolling success estimates, remaining budget, queue backlog, latency, and progress. It does not see true provider health. Provider status `0.5` means unprobed/unknown, not healthy.
+The environment has two scoring layers:
+- Step reward in `budget_router/reward.py::step_reward`: dense learning signal with success/failure, cost, SLA penalty, and a catastrophic budget-exhaustion path in `BudgetRouterEnv.step`.
+- Episode grader in `budget_router/reward.py::grade_episode`: semantic benchmark score in `[0, 1]` using success, latency, budget, SLA, and adaptation.
+This distinction matters. Reward hacking usually appears when a policy optimizes a shaped reward or loophole that does not match the semantic grader. The most important checks below are designed to catch that quickly.
+## The 20-30% Command Ladder
+Run these in order when you want high confidence fast. Stop at the first failure and inspect before spending tokens or API calls on larger experiments.
+### 1. Install the Base Environment
+```bash
+uv sync
+```
+Why: this is the minimal dependency set for unit tests, heuristic policy checks, environment validation, and non-LLM traces. It does not require API keys or training dependencies.
+Red flags:
+- dependency resolution fails
+- imports fail for `openenv_core`, `typer`, or local `budget_router`
+- tests below require hidden setup not documented in code
+### 2. Run the Unit and Regression Tests
+```bash
+uv run pytest budget_router/tests
+```
+Why: this is the fastest broad guardrail. It covers deterministic resets, observation bounds, reward sanity, anti-abstention grader semantics, `hard_multi` adaptation windows, seed selection, LLM prompt structure, trace output shape, and GRPO reward behavior.
+Highest-value test areas:
+- `test_environment.py::TestGraderSemantics`: catches reward gaming by always shedding or partially abstaining.
+- `test_environment.py::TestBehavioralGuards`: catches heuristic budget-exhaustion regressions on `hard_multi`.
+- `test_eval_all_seed_selection.py`: catches seed-bucket drift and explicit fresh-seed parsing regressions.
+- `test_inference_prompt.py`: catches LLM prompt regressions around budget runway, noise calibration, task name, and bankruptcy warnings.
+- `test_grpo_training_reward.py`: catches GRPO reward mistakes where incomplete episodes get full grader credit.
+Red flags:
+- pure abstention scores too high
+- partial abstention beats full service
+- `hard_multi` adaptation ignores the secondary degradation window
+- explicit seeds no longer override named seed sets
+- LLM prompt loses `0.500 = unobserved`, budget runway, or bankruptcy constraints
+- GRPO partial episodes get the full episode grader
+### 3. Run No-API Environment Validation
+```bash
+uv run python -m budget_router.validation
+```
+Why: this compares random, heuristic, oracle, and degenerate policies across tasks and seed sets without calling an LLM. It is the best single command for environment validity, reward-gaming resistance, and oracle-vs-baseline headroom.
+What it checks from source:
+- `random_policy`: lower-bound behavior.
+- `heuristic_baseline_policy`: public-observation, cheapest-viable baseline.
+- `debug_upper_bound_policy`: oracle/debug policy with privileged internal health access.
+- degenerate policies: always A, always B, always C, always shed.
+- hard assertions: baseline beats random on core tasks, oracle beats baseline, degenerate policies do not all dominate, heldout behavior is stable, rewards are not NaN, episodes do not exceed 20 steps.
+How to interpret:
+- Oracle above heuristic means the environment has exploitable headroom.
+- Heuristic above random means the benchmark is not noise.
+- Degenerate policies failing to dominate means the grader is not trivially gameable.
+- Heldout stability means basic environment behavior is not seed-fragile.
+Red flags:
+- oracle no longer beats heuristic on any meaningful task
+- random beats heuristic broadly outside the intentionally hard `hard_multi` caveat
+- always shed or always C dominates the heuristic
+- validation passes only because assertions were weakened
+- reward means shift sharply without a corresponding intentional source change in `tasks.py`, `environment.py`, or `reward.py`
+### 4. Inspect Exact-Seed Behavior With Traces
+Use traces when aggregate numbers move or when you suspect reward hacking. Start with heuristic because it is deterministic and no-API.
+**Progress while the episode runs:** By default, `eval/trace_episode.py` prints nothing until the episode completes (then it prints the full table and optional JSON). For **~20 sequential LLM calls**, that can look “stuck.” Pass `**--verbose`** or `**-v**` to print one `**[trace]**` line per environment step as it happens (`step`, `action`, step `reward`, cumulative reward, `done`). For `**--policy llm**`, you also get a `**[trace] begin …**` line before the first network call, and `**llm_error=…**` when a step falls back after an API error.
+```bash
+uv run python eval/trace_episode.py \
+  --task hard_multi \
+  --seed 3 \
+  --policy heuristic \
+  --verbose \
+  --output-json outputs/trace_heuristic_hard_multi_seed3.json
+```
+If training extras are installed: use the bundled `trained_models/ppo_hard_multi_100k.zip`, or train from scratch first (overwrites the default save path used by the trace script):
+```bash
+uv sync --extra training
+# Recreate the checkpoint from scratch (optional if zip already present and trusted)
+uv run python train/train_ppo_hard_multi.py
+uv run python eval/trace_episode.py \
+  --task hard_multi \
+  --seed 3 \
+  --policy ppo \
+  --verbose \
+  --output-json outputs/trace_ppo_hard_multi_seed3.json
+```
+If API credentials are configured:
+```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
+export HF_TOKEN="<your-token>"
+uv run python eval/trace_episode.py \
+  --task hard_multi \
+  --seed 3 \
+  --policy llm \
+  --verbose \
+  --output-json outputs/trace_llm_hard_multi_seed3.json
+```
+Why: After the episode, `eval/trace_episode.py` prints the public observation before each action plus action, provider, success, reward, cumulative reward, cost, budget, latency, and final grader breakdown. With `**--verbose**`, you also see **per-step progress during** the run (recommended for LLM). This is the fastest way to see whether a policy is actually adapting or merely exploiting a scoring artifact.
+Red flags:
+- policy sheds many steps but grader remains high
+- policy burns budget early and still scores well
+- policy never probes unknown providers but appears to infer hidden health
+- LLM repeatedly switches on one noisy failure despite the prompt's noise calibration
+- PPO repeatedly chooses a degenerate sequence such as always C or always shed
+- traces expose hidden provider health to the acting policy; the trace may display evidence after the fact, but policy inputs should remain public observations
+### 5. Reproduce Heuristic vs LLM Claims by Seed Bucket
+Set credentials only for LLM runs:
+```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
+export HF_TOKEN="<your-token>"
+```
+Dev full-suite check:
+```bash
+uv run python eval/eval_all.py \
+  --tasks easy --tasks medium --tasks hard --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seeds 10 \
+  --seed-set dev \
+  --out-dir outputs/repro_dev_alltasks
+```
+Heldout `hard_multi` check:
+```bash
+uv run python eval/eval_all.py \
+  --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seeds 10 \
+  --seed-set heldout \
+  --out-dir outputs/repro_heldout_hard_multi
+```
+Fresh arbitrary-seed check:
+```bash
+uv run python eval/eval_all.py \
+  --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seed-values "200,201,202,203,204,205,206,207,208,209" \
+  --out-dir outputs/repro_fresh_200_209_hard_multi
+```
+Why: `eval/eval_all.py` writes timestamped JSON and Markdown summaries. Its seed logic has explicit named buckets for `dev` and `heldout`, plus `--seed-values` for arbitrary fresh seeds. Fresh seeds are the main defense against "tuned on heldout" critiques.
+How to interpret:
+- Dev is useful for smoke and comparison with existing README claims.
+- Heldout is the first real overfitting check.
+- Fresh seeds are the strongest quick falsifier of prompt/guard overfitting.
+- Compare paired seeds, not just aggregate means; LLM and heuristic should be evaluated on the same seeds.
+Red flags:
+- LLM only wins on dev and collapses on heldout/fresh
+- LLM improvement comes mostly from one outlier seed
+- LLM loses the `hard_multi` adaptation sub-score while gaining budget score via excessive shedding
+- LLM invalid outputs are silently converted to `shed_load` too often
+- API/model changes make results incomparable without recording `MODEL_NAME`, endpoint, date, and prompt mode
+Optional raw LLM audit:
+```bash
+LLM_LOG_RAW=1 LLM_LOG_RAW_MAX_CHARS=400 \
+uv run python eval/eval_all.py \
+  --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seed-values "200,201,202" \
+  --out-dir outputs/repro_llm_raw_audit
+```
+Why: this helps distinguish real policy behavior from parser/guard artifacts. The parser in `inference.py` extracts a valid action string when present and falls back to `shed_load` when parsing fails.
+### 6. Evaluate the Included PPO Hard_Multi Policy
+```bash
+uv sync --extra training
+uv run python train/eval_hard_multi.py
+```
+Why: this is the source-backed PPO comparison path for `hard_multi`. It loads `trained_models/ppo_hard_multi_100k.zip`, evaluates deterministic PPO on seeds `0-9`, evaluates the heuristic on the same seeds, reports mean/std/95% CI/win rate/subscores, and writes `outputs/ppo_hard_multi_eval.json`.
+Red flags:
+- model file is missing
+- PPO no longer beats heuristic on most paired seeds
+- PPO wins only by budget preservation while success/adaptation collapse
+- PPO traces reveal degenerate always-action behavior
+- PPO results are compared against a different seed set than heuristic
+Important limitation: `eval/eval_all.py` accepts `--policies ppo` but currently only warns that PPO is not wired there. Use `train/eval_hard_multi.py` or `eval/trace_episode.py --policy ppo` (optional `--verbose`) for PPO evidence.
+### 7. Retrain PPO Only When You Need to Revalidate Learnability
+```bash
+uv sync --extra training
+uv run python train/train_ppo_hard_multi.py
+uv run python train/eval_hard_multi.py
+```
+Why: training is expensive relative to the other checks. Run it when source changes touch `environment.py`, `reward.py`, `tasks.py`, `train/gym_wrapper.py`, or PPO hyperparameters. The current training script uses Stable-Baselines3 PPO, `MlpPolicy`, 4 parallel envs, 100k steps, and saves `trained_models/ppo_hard_multi_100k.zip`.
+Red flags:
+- PPO cannot improve after training
+- training reward improves but grader does not
+- policy learns to terminate early or exploit budget scoring
+- learned behavior is strong on dev seeds but weak on exact fresh traces
+### 8. GRPO/Tool-Calling Smoke Checks
+Use this only if you are touching GRPO/training-wrapper code:
+```bash
+## blocked for now till we fix GRPO
+#uv sync --extra grpo
+#PYTORCH_ENABLE_MPS_FALLBACK=1 uv run python train/smoke_test.py
+```
+Why: this validates model-to-tool-to-environment-to-reward plumbing. It is not evidence of learning. The unit tests around `train/grpo_env.py` and `train/learn_experiment.py` are more important for reward correctness.
+Red flags:
+- model makes no tool calls and receives nonzero reward
+- incomplete episodes receive full grader score
+- tool wrapper constructs custom history instead of delegating to `BudgetRouterEnv.step`
+- action-sequence diversity collapses before learning is expected
+## Policy Definitions
+Heuristic policy:
+- Defined in `budget_router/policies.py::heuristic_baseline_policy`.
+- Uses only public `Observation`.
+- Chooses the cheapest provider with status above `0.52` or unprobed `0.5`.
+- Applies a simple low-budget guard that excludes expensive C below `0.10` budget fraction.
+- This is a reactive baseline, not an oracle.
+Oracle/debug upper-bound policy:
+- Defined in `budget_router/policies.py::debug_upper_bound_policy`.
+- Uses privileged `InternalState`, including true provider health and remaining budget.
+- It is validation-only and should never be presented as a deployable policy.
+- Its purpose is to prove there is headroom above the public-observation heuristic.
+LLM policy:
+- Defined in `inference.py::LLMRouter`.
+- Uses an OpenAI-compatible chat API.
+- Prompt requires exactly one action string.
+- Adds trend text, budget runway, task name, and optional previous-step feedback.
+- Applies `_apply_budget_safety_guard`, which vetoes actions that would immediately exhaust public remaining budget.
+- Parser fallback is `shed_load`; frequent fallback is a red flag, not a win.
+PPO policy:
+- Training path: `train/train_ppo_hard_multi.py`.
+- Evaluation path: `train/eval_hard_multi.py`.
+- Trace path: `eval/trace_episode.py --policy ppo` (optional `--verbose` / `-v` for per-step lines during the run).
+- Gym wrapper: `train/gym_wrapper.py`.
+- Current headline PPO scope is `hard_multi`, not all tasks.
+Degenerate policies:
+- Defined in `budget_router/policies.py`.
+- Always A, always B, always C, always shed.
+- These are not competitors; they are exploit detectors.
+## What Counts as "Results Still Stand"
+The README claims are still credible only if the following all hold:
+1. Unit tests pass, especially grader semantics and seed-selection tests.
+2. `budget_router/validation.py` still shows non-triviality, oracle headroom, degenerate-policy resistance, heldout stability, no NaNs, and no >20-step episodes.
+3. Exact traces show plausible adaptation rather than abstention, parser fallback, or hidden-state leakage.
+4. LLM vs heuristic remains positive on paired heldout and fresh `hard_multi` seeds, not just dev.
+5. PPO evaluation through `train/eval_hard_multi.py` still beats heuristic on paired dev seeds if PPO claims are retained.
+6. Any material drift is reflected in `README.md`; do not preserve old claims if the source-backed commands contradict them.
+## Fast Failure Triage
+If unit tests fail:
+- Inspect `reward.py` first for grader regressions.
+- Inspect `environment.py` next for step history, budget exhaustion, termination, observation bounds, and degradation timing.
+- Inspect `tasks.py` if task difficulty or seed outcomes moved unexpectedly.
+If validation fails:
+- Compare random, heuristic, oracle, and degenerate rows.
+- If degenerate policies dominate, the grader or task economics are probably gameable.
+- If oracle has no headroom, the task is too easy or the oracle/health dynamics changed.
+- If heuristic is unstable across seed sets, check degradation jitter and stochastic success paths.
+If LLM results fail:
+- Confirm `MODEL_NAME`, endpoint, prompt mode, and credentials.
+- Run a one-seed trace with `--policy llm` (add `--verbose` so each step logs while waiting on the API).
+- Enable `LLM_LOG_RAW=1` for a small seed slice.
+- Check whether failures are reasoning failures, parser failures, safety-guard interventions, or API/model drift.
+If PPO results fail:
+- Confirm `trained_models/ppo_hard_multi_100k.zip` exists.
+- Run `eval/trace_episode.py --policy ppo` (optionally `--verbose`) on a winning and losing seed.
+- Check whether `train/gym_wrapper.py` observation/action mapping still matches `BudgetRouterEnv`.
+- Retrain only after source-level checks pass.
+## Minimum Evidence Bundle for a PR or Submission
+For a fast but serious evidence package, save outputs from:
+```bash
+uv run pytest budget_router/tests
+uv run python -m budget_router.validation
+uv run python eval/trace_episode.py \
+  --task hard_multi \
+  --seed 3 \
+  --policy heuristic \
+  --verbose \
+  --output-json outputs/evidence_trace_heuristic_hard_multi_seed3.json
+uv run python eval/eval_all.py \
+  --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seeds 10 \
+  --seed-set heldout \
+  --out-dir outputs/evidence_heldout_hard_multi
+uv run python eval/eval_all.py \
+  --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seed-values "200,201,202,203,204,205,206,207,208,209" \
+  --out-dir outputs/evidence_fresh_hard_multi
+uv sync --extra training
+uv run python train/eval_hard_multi.py
+```
+This bundle covers correctness, anti-gaming, environment validity, exact behavior, heldout/fresh LLM comparison, and PPO learnability. It is small enough to run before a merge, but broad enough to catch most ways the published claims could become false.

__init__.py ADDED Viewed

File without changes

app_gradio.py ADDED Viewed

	@@ -0,0 +1,408 @@

+"""
+Budget Router — Gradio Visualization Dashboard
+Run: python app_gradio.py  (launches on http://localhost:7860)
+"""
+from __future__ import annotations
+import math
+import time
+from typing import Dict, Optional, Tuple
+import gradio as gr
+from budget_router.environment import BudgetRouterEnv
+from budget_router.models import Action, ActionType
+from budget_router.tasks import TASK_PRESETS
+from gradio_ui.config import MAX_STEPS as _MAX_STEPS, POLICY_CHOICES, SCENARIOS
+from gradio_ui.policies import get_policy_runner
+from gradio_ui.renderers import (
+    _kpi_grid,
+    render_incident_timeline,
+    render_side_panel,
+    render_grader_plot,
+    MISSION_SCORE_HELP,
+    MISSION_SCORE_LABEL,
+    _GRADER_PENDING,
+    _PROVIDER_EMPTY,
+    render_history_table_compare,
+)
+from gradio_ui.state import fresh_side_state, _observation_to_dict, record_step
+from gradio_ui.theme import LIGHT_CSS, THEME
+MAX_STEPS = _MAX_STEPS
+# Compatibility: preserve module-level MAX_STEPS for callers.
+# ─── UI Build ─────────────────────────────────────────────────────────────────
+def build_app() -> gr.Blocks:
+    def _normalize_seed(seed: object, default: int = 42) -> int:
+        if seed is None:
+            return default
+        try:
+            val = float(seed)  # type: ignore[arg-type]
+        except Exception:
+            return default
+        if math.isnan(val) or math.isinf(val):
+            return default
+        try:
+            return int(val)
+        except Exception:
+            return default
+    with gr.Blocks(title="Budget Router — Policy Comparison", theme=THEME, css=LIGHT_CSS) as demo:
+        left_state = gr.State(fresh_side_state())
+        right_state = gr.State(fresh_side_state())
+        run_state = gr.State({"running": False, "scenario": "easy", "seed": 42, "step": 0})
+        gr.Markdown(
+            "# Budget Router — Policy Comparison\n"
+            "_Select 2 policies · start episode · step or finish episode · compare outcomes_"
+        )
+        with gr.Row():
+            with gr.Column(scale=1):
+                left_title = gr.Markdown("## Policy A")
+                left_policy = gr.Dropdown(choices=POLICY_CHOICES, value=None, label="Select policy")
+                left_status = gr.Textbox(label="Status", interactive=False, lines=2)
+                left_providers = gr.HTML(_PROVIDER_EMPTY())
+                left_budget = gr.HTML("")
+                left_kpis = gr.HTML(
+                    _kpi_grid(
+                        [
+                            ("Step", "—"),
+                            ("Last action", "—"),
+                            ("Latency (ms)", "—"),
+                            ("Budget remaining", "—"),
+                            ("Reward", "—"),
+                            ("Adaptation", "—"),
+                        ]
+                    )
+                )
+                left_badges = gr.HTML("")
+                left_summary = gr.HTML(
+                    _kpi_grid(
+                        [
+                            ("Failed %", "—"),
+                            ("SLA breach %", "—"),
+                            ("Avg latency (ms)", "—"),
+                        ]
+                    )
+                )
+            with gr.Column(scale=1):
+                right_title = gr.Markdown("## Policy B")
+                right_policy = gr.Dropdown(choices=POLICY_CHOICES, value=None, label="Select policy")
+                right_status = gr.Textbox(label="Status", interactive=False, lines=2)
+                right_providers = gr.HTML(_PROVIDER_EMPTY())
+                right_budget = gr.HTML("")
+                right_kpis = gr.HTML(
+                    _kpi_grid(
+                        [
+                            ("Step", "—"),
+                            ("Last action", "—"),
+                            ("Latency (ms)", "—"),
+                            ("Budget remaining", "—"),
+                            ("Reward", "—"),
+                            ("Adaptation", "—"),
+                        ]
+                    )
+                )
+                right_badges = gr.HTML("")
+                right_summary = gr.HTML(
+                    _kpi_grid(
+                        [
+                            ("Failed %", "—"),
+                            ("SLA breach %", "—"),
+                            ("Avg latency (ms)", "—"),
+                        ]
+                    )
+                )
+        with gr.Row():
+            with gr.Column(scale=2):
+                gr.Markdown("### Episode Controls")
+                scenario_sel = gr.Radio(SCENARIOS, value="easy", label="Scenario")
+                seed_inp = gr.Number(value=42, label="Seed", precision=0)
+                start_btn = gr.Button("▶ Start Episode", variant="primary", interactive=False)
+                with gr.Row():
+                    step_btn = gr.Button("→ Step", variant="secondary", interactive=False)
+                    fast_btn = gr.Button("⚡ Fast-forward", interactive=False)
+                    finish_btn = gr.Button("⏩ Finish Episode", interactive=False)
+        gr.Markdown(f"### {MISSION_SCORE_LABEL} (comparison)\n_{MISSION_SCORE_HELP}_")
+        grader_plot = gr.Plot()
+        with gr.Row(elem_classes=["episode-history-row"]):
+            with gr.Column(scale=1):
+                left_history_title = gr.Markdown("### Step History — Policy A")
+                left_history_tbl = gr.HTML(render_history_table_compare([]), elem_classes=["episode-history-table"])
+            with gr.Column(scale=1):
+                right_history_title = gr.Markdown("### Step History — Policy B")
+                right_history_tbl = gr.HTML(render_history_table_compare([]), elem_classes=["episode-history-table"])
+        with gr.Row():
+            with gr.Column(scale=1):
+                left_grade_title = gr.Markdown(f"### {MISSION_SCORE_LABEL} — Policy A")
+                left_grade = gr.HTML(_GRADER_PENDING())
+            with gr.Column(scale=1):
+                right_grade_title = gr.Markdown(f"### {MISSION_SCORE_LABEL} — Policy B")
+                right_grade = gr.HTML(_GRADER_PENDING())
+        gr.Markdown("### Incident Timeline")
+        incidents_html = gr.HTML(render_incident_timeline("easy"))
+        def _render_side(side: Dict, run: Dict, scenario_name: str) -> Tuple[str, str, str, str, str, str, str, str]:
+            return render_side_panel(side, run, scenario_name)
+        def _render_all(ls: Dict, rs: Dict, run: Dict) -> tuple:
+            scenario_name = str(run.get("scenario", "easy") or "easy")
+            l_out = _render_side(ls, run, scenario_name)
+            r_out = _render_side(rs, run, scenario_name)
+            plot = render_grader_plot(
+                ls.get("history", []) or [],
+                rs.get("history", []) or [],
+                left_name=str(ls.get("policy_name") or ""),
+                right_name=str(rs.get("policy_name") or ""),
+            )
+            incidents = render_incident_timeline(scenario_name)
+            running = bool(run.get("running", False))
+            btn_update = gr.update(interactive=running)
+            config_update = gr.update(interactive=(not running))
+            return (
+                ls,
+                rs,
+                run,
+                l_out[0],
+                l_out[1],
+                l_out[2],
+                l_out[3],
+                l_out[4],
+                l_out[5],
+                r_out[0],
+                r_out[1],
+                r_out[2],
+                r_out[3],
+                r_out[4],
+                r_out[5],
+                l_out[6],
+                r_out[6],
+                l_out[7],
+                r_out[7],
+                plot,
+                incidents,
+                config_update,
+                config_update,
+                config_update,
+                config_update,
+                config_update,
+                btn_update,
+                btn_update,
+                btn_update,
+            )
+        OUTPUTS = [
+            left_state,
+            right_state,
+            run_state,
+            left_status,
+            left_providers,
+            left_budget,
+            left_kpis,
+            left_badges,
+            left_summary,
+            right_status,
+            right_providers,
+            right_budget,
+            right_kpis,
+            right_badges,
+            right_summary,
+            left_history_tbl,
+            right_history_tbl,
+            left_grade,
+            right_grade,
+            grader_plot,
+            incidents_html,
+            left_policy,
+            right_policy,
+            scenario_sel,
+            seed_inp,
+            start_btn,
+            step_btn,
+            fast_btn,
+            finish_btn,
+        ]
+        GRADER_PLOT_IDX = OUTPUTS.index(grader_plot)
+        def _update_start_enabled(p1: Optional[str], p2: Optional[str], run: Dict):
+            left_name = str(p1 or "Policy A")
+            right_name = str(p2 or "Policy B")
+            running = bool((run or {}).get("running", False))
+            ok = (bool(p1) and bool(p2)) and (not running)
+            return (
+                gr.update(interactive=ok),
+                f"## {left_name}",
+                f"## {right_name}",
+                f"### Step History — {left_name}",
+                f"### Step History — {right_name}",
+                f"### {MISSION_SCORE_LABEL} — {left_name}",
+                f"### {MISSION_SCORE_LABEL} — {right_name}",
+            )
+        left_policy.change(
+            _update_start_enabled,
+            inputs=[left_policy, right_policy, run_state],
+            outputs=[start_btn, left_title, right_title, left_history_title, right_history_title, left_grade_title, right_grade_title],
+        )
+        right_policy.change(
+            _update_start_enabled,
+            inputs=[left_policy, right_policy, run_state],
+            outputs=[start_btn, left_title, right_title, left_history_title, right_history_title, left_grade_title, right_grade_title],
+        )
+        scenario_sel.change(lambda s: render_incident_timeline(s), inputs=[scenario_sel], outputs=[incidents_html])
+        def do_start(p1: str, p2: str, scenario: str, seed: Optional[float], _ls: Dict, _rs: Dict, _run: Dict):
+            ls = fresh_side_state()
+            rs = fresh_side_state()
+            seed_int = _normalize_seed(seed, default=42)
+            if not p1 or not p2:
+                run = {"running": False, "scenario": scenario, "seed": seed_int, "step": 0}
+                ls["status"] = "Select both policies to start."
+                rs["status"] = "Select both policies to start."
+                return _render_all(ls, rs, run)
+            runner_l, err_l = get_policy_runner(p1)
+            runner_r, err_r = get_policy_runner(p2)
+            if err_l or err_r or runner_l is None or runner_r is None:
+                ls["status"] = f"❌ {err_l}" if err_l else ""
+                rs["status"] = f"❌ {err_r}" if err_r else ""
+                run = {"running": False, "scenario": scenario, "seed": seed_int, "step": 0}
+                return _render_all(ls, rs, run)
+            env_l = BudgetRouterEnv()
+            env_r = BudgetRouterEnv()
+            obs_l = env_l.reset(seed=seed_int, scenario=scenario)
+            obs_r = env_r.reset(seed=seed_int, scenario=scenario)
+            try:
+                runner_l.reset(scenario)
+            except Exception:
+                pass
+            try:
+                runner_r.reset(scenario)
+            except Exception:
+                pass
+            ls.update(
+                {
+                    "env": env_l,
+                    "policy_name": p1,
+                    "policy_runner": runner_l,
+                    "obs": _observation_to_dict(obs_l),
+                    "status": f"✅ Running · {p1}",
+                }
+            )
+            rs.update(
+                {
+                    "env": env_r,
+                    "policy_name": p2,
+                    "policy_runner": runner_r,
+                    "obs": _observation_to_dict(obs_r),
+                    "status": f"✅ Running · {p2}",
+                }
+            )
+            run = {"running": True, "scenario": scenario, "seed": seed_int, "step": 0}
+            return _render_all(ls, rs, run)
+        def _apply_local_step(side: Dict, scenario_name: str, global_step: int) -> Dict:
+            if side.get("done"):
+                return side
+            env = side.get("env")
+            runner = side.get("policy_runner")
+            if env is None or runner is None:
+                side["done"] = True
+                side["status"] = "❌ Not initialized"
+                return side
+            try:
+                action_str = runner.choose_action(side.get("obs", {}) or {})
+            except Exception as exc:
+                side["done"] = True
+                side["status"] = f"❌ Policy error: {exc}"
+                return side
+            pre_obs = dict(side.get("obs", {}) or {})
+            obs_obj = env.step(Action(action_type=ActionType(action_str)))
+            obs = _observation_to_dict(obs_obj)
+            reward = float(obs.get("reward", 0.0) or 0.0)
+            meta = dict(obs.get("metadata", {}) or {})
+            done = bool(obs.get("done", False))
+            side["history"].append(record_step(global_step, action_str, obs, reward, meta, health_obs=pre_obs))
+            side["obs"] = obs
+            side["cumulative_reward"] = float(side.get("cumulative_reward", 0.0) or 0.0) + reward
+            side["done"] = done
+            side["status"] = "✅ Done" if done else str(side.get("status", ""))
+            return side
+        def do_step(ls: Dict, rs: Dict, run: Dict):
+            if not bool(run.get("running", False)):
+                return _render_all(ls, rs, run)
+            if int(run.get("step", 0) or 0) >= MAX_STEPS:
+                run["running"] = False
+                return _render_all(ls, rs, run)
+            next_step = int(run.get("step", 0) or 0) + 1
+            scenario = str(run.get("scenario", "easy") or "easy")
+            ls = _apply_local_step(ls, scenario, next_step)
+            rs = _apply_local_step(rs, scenario, next_step)
+            run["step"] = next_step
+            if next_step >= MAX_STEPS or (ls.get("done") and rs.get("done")):
+                run["running"] = False
+            return _render_all(ls, rs, run)
+        def _stream_to_end(ls: Dict, rs: Dict, run: Dict):
+            if not bool(run.get("running", False)):
+                yield _render_all(ls, rs, run)
+                return
+            frozen = _render_all(ls, rs, run)
+            frozen_grader_plot = frozen[GRADER_PLOT_IDX]
+            while bool(run.get("running", False)) and int(run.get("step", 0) or 0) < MAX_STEPS:
+                out = do_step(ls, rs, run)
+                ls, rs, run = out[0], out[1], out[2]
+                out_list = list(out)
+                out_list[GRADER_PLOT_IDX] = frozen_grader_plot
+                yield tuple(out_list)
+                time.sleep(0.12)
+                if not bool(run.get("running", False)):
+                    break
+            yield _render_all(ls, rs, run)
+        def do_fast_forward(ls: Dict, rs: Dict, run: Dict):
+            yield from _stream_to_end(ls, rs, run)
+        def do_finish(ls: Dict, rs: Dict, run: Dict):
+            yield from _stream_to_end(ls, rs, run)
+        start_btn.click(do_start, inputs=[left_policy, right_policy, scenario_sel, seed_inp, left_state, right_state, run_state], outputs=OUTPUTS)
+        step_btn.click(do_step, inputs=[left_state, right_state, run_state], outputs=OUTPUTS)
+        fast_btn.click(do_fast_forward, inputs=[left_state, right_state, run_state], outputs=OUTPUTS)
+        finish_btn.click(do_finish, inputs=[left_state, right_state, run_state], outputs=OUTPUTS)
+    return demo
+if __name__ == "__main__":
+    app = build_app()
+    app.queue()
+    app.launch(server_port=7860)

blog.md ADDED Viewed

	@@ -0,0 +1,257 @@

+# Budget Router: Teaching Agents to Survive Cascading API Failures Under Budget
+Production AI systems do not fail politely.
+An application may depend on several LLM or API providers, each with different cost, latency, and reliability profiles. One provider becomes flaky. Traffic shifts. The next fallback becomes overloaded or starts degrading. The system still has a budget, users still expect latency, and the router never sees the true internal health of the providers. It only sees noisy public signals: recent success rates, backlog, latency, and remaining budget.
+That is the problem Budget Router is built to study.
+Budget Router is an OpenEnv-compliant reinforcement learning environment where an agent routes each request to Provider A, B, C, or sheds load. A is cheap, B is moderate, C is reliable but expensive. The agent's job is not simply to pick the best provider now. It must preserve enough budget to survive what happens later.
+The interesting case is `Hard_Multi`: Provider A degrades from the beginning, and Provider B cascades later in the episode. This creates a two-phase incident. A naive router can look reasonable early and still fail late because it spent too much budget before the real cascade arrived.
+This is a small environment, but it captures a real infrastructure question:
+> Can an agent learn budget-aware reliability behavior under partial observability and non-stationary provider degradation?
+## TL;DR
+Budget Router is not a claim that a 20-step toy simulation is production routing. It is a compact, reproducible benchmark for a production-shaped failure mode: budgeted API routing under cascading degradation.
+On the headline `Hard_Multi` task, we compare three policy families:
+| Policy | What it is | Hard_Multi grader | Main takeaway |
+|---|---|---:|---|
+| Heuristic | Hand-coded reactive baseline | ~0.61 | A real baseline, but brittle under cascade failure |
+| Zero-shot LLM | Qwen2.5-72B with a deterministic budget guard | ~0.65 | In-context reasoning helps when observations are semantically meaningful |
+| PPO | Small SB3 MLP trained on the environment | ~0.69 | The reward signal is learnable and stronger than hand rules |
+```mermaid
+flowchart LR
+    H["Heuristic baseline<br/>0.61<br/>hand-coded rules"] --> L["Zero-shot LLM<br/>0.65<br/>Qwen2.5-72B + budget guard"]
+    L --> P["Trained PPO<br/>0.69<br/>SB3 MLP, 100k steps"]
+```
+We also ran post-training experiments beyond PPO:
+- SFT on Qwen2.5-1.5B via Hugging Face Jobs completed end-to-end, but did **not** beat the heuristic on the latest 10-seed evaluation: `0.577` vs `0.596`, with 3/10 wins.
+- GRPO was attempted, but did not converge reliably in our setup.
+- The negative result is useful: this environment rewards sequential credit assignment, probing, recovery, and budget conservation. Plain behavioral cloning can imitate action patterns without learning why those actions matter.
+![Budget Router evidence](figures/budget_router_evidence.png)
+*Figure: README evidence summary. The strongest claims are the three-policy ordering on `Hard_Multi`, heldout/fresh seed generalization for the LLM, and adaptation-score gains over the reactive heuristic.*
+## The Environment
+Budget Router exposes a simple action space:
+- `route_to_a`
+- `route_to_b`
+- `route_to_c`
+- `shed_load`
+The observation is intentionally public and partial. The policy sees:
+- rolling provider success estimates,
+- remaining budget,
+- queue backlog,
+- system latency,
+- episode progress.
+It does **not** see the true hidden provider health. This makes the problem a partially observable decision problem rather than a lookup table. The agent has to infer whether a provider is actually degrading or whether it just saw noise.
+The task suite escalates difficulty:
+| Task | Degradation pattern | Why it matters |
+|---|---|---|
+| `Easy` | No degradation | Budget-conservative rules are hard to beat |
+| `Medium` | A degrades after step 5 | Reactive switching begins to matter |
+| `Hard` | A degrades from step 0 | Early adaptation matters |
+| `Hard_Multi` | A degrades from step 0, B from step 10 | Cascade failure forces budget-aware anticipation |
+`Hard_Multi` is the core benchmark. If the router burns money on expensive fallbacks too early, it may have no budget left when B starts failing. If it stays cheap for too long, it loses success and SLA. If it sheds load too often, it avoids cost but fails the user.
+That is the point: there is no single dominant action.
+## The Grader
+The episode grader is a weighted score in `[0, 1]`:
+```text
+overall = 0.30 * success
+        + 0.20 * latency
+        + 0.15 * budget
+        + 0.15 * SLA
+        + 0.20 * adaptation
+```
+The grader is designed so that obvious reward hacks are unattractive:
+| Shortcut | Why it fails |
+|---|---|
+| Always route to C | Good latency, but expensive and budget-risky |
+| Always shed load | Avoids cost, but earns no success or adaptation |
+| Always use A | Cheap, but collapses once A degrades |
+| Switch only after failure | Too late in `Hard_Multi`, because budget and latency errors compound |
+This is best understood as a soft-constraint MDP. Budget and SLA pressure are real and measured, but they are encoded through reward terms rather than enforced through a full constrained-MDP Lagrangian. That distinction matters. The environment is honest about tradeoffs instead of pretending the constraint design is solved.
+## What Worked
+### 1. The heuristic is a real baseline, not a strawman
+The heuristic uses public observations and chooses the cheapest viable provider. It is budget-aware and reactive. On easy settings, this is exactly the kind of policy that should be strong.
+That is important for judge trust. If a learned policy only beats random or a broken baseline, the environment is not very informative. Budget Router's baseline is good enough to make improvement nontrivial, but limited enough that cascade failure exposes its weakness.
+On `Hard_Multi`, the heuristic reaches roughly `0.61`. It is not useless; it is just too reactive for a delayed cascade.
+### 2. Zero-shot LLM routing improves because the state is semantically meaningful
+The LLM policy is not trained on Budget Router. It receives structured observations with meaningful field names:
+```text
+provider_a_status: 0.42
+budget_remaining: 0.31
+queue_backlog: 0.20
+system_latency: 0.55
+step_count: 0.60
+```
+That matters. A language model can reason about "budget remaining," "provider status," and "latency" without gradient updates. The prompt also includes practical routing guidance: do not treat an unprobed `0.500` status as confirmed health, pay attention to trends, and avoid bankruptcy.
+The production-facing LLM policy includes a deterministic budget-safety guard. This is not hidden. It is a deliberate agentic-system pattern: use the model for high-level routing judgment, and use deterministic code for arithmetic-critical safety. Without this guard, raw LLM behavior can sometimes spend itself into the budget cliff.
+On the README's combined `Hard_Multi` evaluation, the LLM improves over the heuristic across dev, heldout, and fresh seed buckets. The important claim is not that the LLM is magical. The claim is that semantically self-describing environments let a foundation model bring useful priors to a new control problem.
+### 3. PPO proves the environment is learnable
+PPO is a small neural policy trained directly on environment interaction. It is not an LLM, and it is not the post-training story. Its role is scientific: if a small policy gradient method can improve over the heuristic, the reward signal has enough structure to optimize.
+The PPO policy uses the same environment mechanics through a Gym wrapper. The wrapper converts OpenEnv-style typed observations into arrays for Stable-Baselines3, but PPO still routes through the same `BudgetRouterEnv.step()` dynamics and grader.
+On `Hard_Multi`, PPO reaches roughly `0.69` and beats the heuristic across the reported seeds. The adaptation sub-score is the clearest mechanism: PPO learns to preserve budget early and route more effectively when the cascade arrives.
+The honest limitation is that PPO sees `step_count`. In a fixed 20-step task, it may learn a schedule keyed partly to the clock: switch away from A early, prepare for B around step 10. That is still useful environment-validation evidence, but it is not the same as proving open-ended reactive reasoning. The LLM result is the stronger evidence for in-context reactive use of semantic observations.
+## What Did Not Work
+The post-training experiments are just as important as the wins.
+### SFT: the pipeline worked, the policy did not improve enough
+We built a full supervised fine-tuning pipeline:
+1. Generate trajectories from a stronger teacher policy.
+2. Convert observations and actions into chat-style training examples.
+3. Push the dataset to Hugging Face.
+4. Train a LoRA adapter on `Qwen/Qwen2.5-1.5B-Instruct` using Hugging Face Jobs.
+5. Merge and push the model.
+6. Evaluate against the heuristic baseline.
+The operational pipeline worked. The HF Jobs flow trained and evaluated the model on GPU infrastructure. This matters for reproducibility: the fine-tuning path is not a sketch; it is runnable through `generate_sft_data.py`, `train_sft.py`, `eval_sft.py`, and `scripts/submit_sft_hf_jobs.sh`.
+But the latest SFT evaluation did not beat the heuristic. On 10 `Hard_Multi` seeds, SFT scored `0.577` vs heuristic `0.596`, winning 3/10 seeds.
+That is not a result to hide. It is the most useful negative result in the project.
+The likely reason is that behavioral cloning sees only good-looking actions, not the counterfactuals. It can learn "route to B often" or "avoid C when budget is low," but it does not directly learn why a near-miss action is bad, how budget errors compound, or when probing is worth the short-term risk.
+In Budget Router, the objective is episodic. One bad switch can erase a good early trajectory. A static label does not carry the full consequence of that decision.
+### GRPO: promising direction, not a successful result yet
+We also attempted GRPO-style reward optimization for an LLM policy. That is the more natural post-training direction for an OpenEnv agent, because the model can interact with the environment and receive reward from actual consequences.
+In our current run, GRPO did not produce a reliable improvement. The pitch notes reward trending downward, weak rollout quality, and mode collapse in the attempted setup. The practical lesson is that GRPO needs more than a valid environment wrapper. It needs enough reward variance, enough model capacity, stable rollouts, and careful exploration.
+So the honest conclusion is:
+> PPO shows the environment is learnable. Zero-shot LLM shows semantic observations are useful. SFT shows imitation alone is not enough. GRPO remains the right research direction, but not a claimed win in this submission.
+## Why This Is Still a Strong Result
+The strongest version of Budget Router is not "we found one trick that wins." It is this:
+```mermaid
+flowchart TD
+    E["OpenEnv environment<br/>partial observability + cascade failure"] --> G["Five-part grader<br/>success, latency, budget, SLA, adaptation"]
+    G --> B["Heuristic baseline<br/>cheap reactive policy"]
+    G --> L["Zero-shot LLM<br/>semantic reasoning + budget guard"]
+    G --> P["PPO<br/>reward-aware optimization"]
+    P --> S["SFT/GRPO attempts<br/>negative results and future direction"]
+```
+Budget Router has the properties a useful post-training environment should have:
+| Property | Evidence |
+|---|---|
+| Non-trivial | Heuristic beats random but leaves headroom; oracle gap is largest on `Hard_Multi` |
+| Learnable | PPO improves over heuristic on the hardest task |
+| Semantically agentic | Zero-shot LLM improves because observations are meaningful |
+| Not trivially gameable | Always-shed and always-expensive policies are penalized |
+| Reproducible | README and `REPRODUCIBILITY.md` describe seed buckets, traces, saved JSON, and command paths |
+| Honest | SFT and GRPO attempts are reported without overstating them |
+That combination is rare in hackathon environments. Many environments are easy to demo but hard to falsify. Budget Router is designed to be falsified: run the seeds, inspect the traces, compare sub-scores, and check whether improvement comes from adaptation rather than a loophole.
+## Reproducibility
+The repo is structured so judges can inspect both aggregate results and exact behavior.
+Key artifacts:
+- `README.md`: headline benchmark tables and evidence figure.
+- `REPRODUCIBILITY.md`: command checklist and falsification guide.
+- `eval/eval_all.py`: heuristic vs LLM evaluation across task and seed buckets.
+- `eval/trace_episode.py`: step-by-step episode traces.
+- `train/eval_hard_multi.py`: PPO evaluation path.
+- `generate_sft_data.py`: SFT dataset generation from teacher trajectories.
+- `train_sft.py`: LoRA SFT training script for Hugging Face Jobs.
+- `eval_sft.py`: SFT model evaluation against the heuristic.
+- `scripts/submit_sft_hf_jobs.sh`: orchestration for data, training, and evaluation jobs.
+For the SFT pipeline, the intended run looks like:
+```bash
+export TEACHER_POLICY=ppo
+export HF_JOB_FLAVOR=a10g-large
+export HF_JOB_NAMESPACE=akshay4
+export DATASET_REPO=akshay4/budget-router-sft-data
+export OUTPUT_REPO=akshay4/budget-router-sft-qwen1.5b
+export SFT_MODEL_REPO=$OUTPUT_REPO
+export SFT_N_EPISODES=100
+export SFT_TOP_FRACTION=0.30
+export NUM_EPOCHS=3
+export N_SEEDS=10
+./scripts/submit_sft_hf_jobs.sh
+```
+The important point is not that this SFT model won. It did not. The important point is that the environment can produce training data, launch model training, push artifacts, and evaluate the resulting policy. That closes the environment-to-training-to-evaluation loop, even when the experimental result is negative.
+## The Research Lesson
+Budget Router is a reminder that post-training methods should match the task.
+For static classification, supervised fine-tuning may be enough. For sequential decision-making under budget constraints, static imitation is often too weak. The agent needs to learn from consequences: what happens after a risky fallback, what happens when it fails to probe, what happens when it saves budget early, and what happens when it arrives at the cascade with no runway left.
+That is why PPO worked better than SFT here. PPO receives feedback from the environment. It optimizes the episode objective directly. The zero-shot LLM also performs well because it brings external priors about risk, cost, and reliability to a semantically described state.
+The next research step is not to pretend SFT solved the problem. It is to use SFT as a warm start or distillation layer, then apply environment-aware RL with better rollout diversity and reward normalization.
+## Conclusion
+Budget Router is an incident-commander environment for budgeted API reliability. It asks a simple question with real consequences:
+> When providers degrade and budget is running out, can an agent adapt before the cascade breaks the system?
+The answer from our experiments is nuanced:
+- hand-coded rules are strong but brittle,
+- zero-shot LLM reasoning helps when the observation schema is meaningful,
+- PPO confirms the environment has a learnable reward signal,
+- SFT and GRPO are not claimed wins, but they reveal where the hard part actually is.
+That is the story we think is worth submitting: a reproducible environment, a real baseline, measurable improvement, and enough intellectual honesty that the failures make the benchmark more credible rather than less.

budget_router/__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""Budget Router Environment - package init."""
+from .environment import BudgetRouterEnv
+from .models import Action, ActionType, EnvState, Observation, TaskConfig
+from .tasks import EASY, HARD, MEDIUM
+__all__ = [
+    "BudgetRouterEnv",
+    "Action",
+    "ActionType",
+    "Observation",
+    "EnvState",
+    "TaskConfig",
+    "EASY",
+    "MEDIUM",
+    "HARD",
+]

budget_router/client.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from dataclasses import asdict
+from typing import Any, Dict
+from openenv_core import HTTPEnvClient
+from openenv_core.client_types import StepResult
+from .models import Action, EnvState, Observation
+class BudgetRouterClient(HTTPEnvClient[Action, Observation]):
+    def _step_payload(self, action: Action) -> Dict[str, Any]:
+        return asdict(action)
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[Observation]:
+        observation_payload = payload.get("observation", payload)
+        observation = Observation(
+            **observation_payload,
+            done=payload.get("done", observation_payload.get("done", False)),
+            reward=payload.get("reward", observation_payload.get("reward")),
+            metadata=observation_payload.get("metadata", payload.get("metadata", {})),
+        )
+        return StepResult(
+            observation=observation,
+            reward=observation.reward,
+            done=observation.done,
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> EnvState:
+        return EnvState(**payload)

budget_router/environment.py ADDED Viewed

	@@ -0,0 +1,515 @@

+"""
+Budget Router Environment — Core RL environment.
+Extends openenv-core Environment base class with the standard
+reset(), step(), state interface. Processes one request per step
+through 3 providers under budget, latency, reliability, and
+degradation constraints.
+"""
+from __future__ import annotations
+import json
+import math
+import random
+import uuid
+from typing import Any, Dict, Optional, Tuple
+from openenv_core.env_server import Environment
+from openenv_core.env_server.types import Action as OpenEnvAction
+from .models import (
+    Action,
+    ActionType,
+    EnvState,
+    InternalState,
+    Observation,
+    ProviderState,
+    TaskConfig,
+)
+from .reward import grade_episode, step_reward
+from .tasks import EASY
+BACKLOG_LATENCY_PER_ITEM_MS = 8.0
+def _reported_score(value: float) -> float:
+    return min(max(float(value), 0.001), 0.999)
+class BudgetRouterEnv(Environment):
+    """
+    Incident Commander for Budgeted Tool/API Reliability.
+    An agent routes incoming requests to one of 3 providers (A, B, C)
+    or sheds load, under budget, latency, and reliability constraints.
+    Extends OpenEnv Environment base class with proper type parameters.
+    Interface:
+        reset(seed, scenario) -> Observation
+        step(action) -> Observation  (reward in obs.reward, done in obs.done)
+        state -> EnvState
+    """
+    def __init__(self, emit_structured_logs: bool = False) -> None:
+        super().__init__()
+        self._internal: InternalState = InternalState()
+        self._config: TaskConfig = EASY
+        self._rng: random.Random = random.Random()
+        self._episode_id: str = ""
+        self._cumulative_reward: float = 0.0
+        self._emit_structured_logs = emit_structured_logs
+        self._episode_number = 0
+        self._current_seed: Optional[int] = None
+    def _emit_log(self, prefix: str, payload: Dict[str, Any]) -> None:
+        if self._emit_structured_logs:
+            print(f"{prefix} {json.dumps(payload)}", flush=True)
+    def _observation_payload(self, observation: Observation) -> Dict[str, float]:
+        return {
+            "provider_a_status": float(observation.provider_a_status),
+            "provider_b_status": float(observation.provider_b_status),
+            "provider_c_status": float(observation.provider_c_status),
+            "budget_remaining": float(observation.budget_remaining),
+            "queue_backlog": float(observation.queue_backlog),
+            "system_latency": float(observation.system_latency),
+            "step_count": float(observation.step_count),
+        }
+    # ─── OpenEnv interface ──────────────────────────────────────────────
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        scenario: Optional[TaskConfig] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        """Reset the environment to initial state."""
+        config = scenario or kwargs.get("scenario", EASY)
+        if isinstance(config, str):
+            from .tasks import TASK_PRESETS
+            config = TASK_PRESETS.get(config, EASY)
+        self._config = config
+        # Seed the RNG
+        if seed is not None:
+            self._rng = random.Random(seed)
+        else:
+            self._rng = random.Random()
+        self._episode_id = episode_id or str(uuid.uuid4())
+        self._episode_number += 1
+        self._current_seed = seed
+        self._cumulative_reward = 0.0
+        # Initialize providers
+        providers = {
+            "A": ProviderState(
+                name="A",
+                base_reliability=config.reliability_a,
+                current_health=config.reliability_a,
+                cost_per_request=config.cost_a,
+                base_latency_ms=config.latency_a,
+            ),
+            "B": ProviderState(
+                name="B",
+                base_reliability=config.reliability_b,
+                current_health=config.reliability_b,
+                cost_per_request=config.cost_b,
+                base_latency_ms=config.latency_b,
+            ),
+            "C": ProviderState(
+                name="C",
+                base_reliability=config.reliability_c,
+                current_health=config.reliability_c,
+                cost_per_request=config.cost_c,
+                base_latency_ms=config.latency_c,
+            ),
+        }
+        # Resolve jittered degradation onsets for this episode
+        _j1 = (self._rng.randint(-config.degradation_start_jitter,
+                                  config.degradation_start_jitter)
+               if config.degradation_start_jitter > 0 else 0)
+        _j2 = (self._rng.randint(-config.secondary_degradation_start_jitter,
+                                  config.secondary_degradation_start_jitter)
+               if config.secondary_degradation_start_jitter > 0 else 0)
+        _actual_primary = max(0, config.degradation_start_step + _j1)
+        _actual_secondary = max(0, config.secondary_degradation_start_step + _j2)
+        self._internal = InternalState(
+            providers=providers,
+            budget_dollars=config.initial_budget,
+            initial_budget_dollars=config.initial_budget,
+            queue_backlog_count=0,
+            max_queue_backlog=config.max_queue_backlog,
+            last_latency_ms=config.latency_a,  # initial non-zero latency
+            sla_ceiling_ms=config.sla_ceiling_ms,
+            current_step=0,
+            max_steps=config.max_steps,
+            episode_done=False,
+            history=[],
+            provider_window={"A": [], "B": [], "C": []},
+            window_size=5,
+            actual_degradation_start=_actual_primary,
+            actual_secondary_degradation_start=_actual_secondary,
+        )
+        observation = self._get_obs()
+        self._emit_log(
+            "[START]",
+            {
+                "task": self._config.name,
+                "seed": int(seed) if seed is not None else -1,
+                "episode": self._episode_number,
+            },
+        )
+        return observation
+    def step(
+        self,
+        action: OpenEnvAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        """
+        Execute one step: route a request or shed load.
+        Returns:
+            Observation with reward set, done flag, and metadata dict.
+        """
+        if self._internal.episode_done:
+            # Already done — return terminal observation
+            obs = self._get_obs()
+            obs.done = True
+            obs.reward = 0.0
+            return obs
+        if not isinstance(action, Action):
+            action = Action(
+                action_type=getattr(action, "action_type"),
+                metadata=getattr(action, "metadata", {}),
+            )
+        if not self._internal.providers:
+            self.reset(seed=self._current_seed, scenario=self._config)
+        self._internal.current_step += 1
+        action_type = action.action_type.value
+        # ── Apply degradation BEFORE processing the request ──
+        self._degrade()
+        # ── Process the action ──
+        step_info: Dict[str, Any] = {
+            "step": self._internal.current_step,
+            "action_type": action_type,
+            "sla_ceiling_ms": self._config.sla_ceiling_ms,
+            "initial_budget": self._internal.initial_budget_dollars,
+            "degradation_start_step": self._internal.actual_degradation_start,
+            "secondary_degradation_start_step": (self._internal.actual_secondary_degradation_start
+                                                  if self._config.secondary_degradation_target else None),
+        }
+        if action_type == "shed_load":
+            # Shed load: no routing, flat penalty
+            reward = step_reward(
+                action_type="shed_load",
+                request_succeeded=False,
+                provider_cost=0.0,
+                initial_budget=self._internal.initial_budget_dollars,
+                latency_ms=0.0,
+                sla_ceiling_ms=self._config.sla_ceiling_ms,
+            )
+            # Queue pressure decreases slightly when shedding
+            self._internal.queue_backlog_count = max(
+                0, self._internal.queue_backlog_count - 1
+            )
+            # Latency set to 0 for shed (no request processed)
+            self._internal.last_latency_ms = 0.0
+            step_info.update(
+                {
+                    "request_succeeded": False,
+                    "cost": 0.0,
+                    "latency_ms": 0.0,
+                    "reward": reward,
+                    "provider": None,
+                    "queue_overflow": False,
+                }
+            )
+        else:
+            # Route to a provider
+            provider_name = {"route_to_a": "A", "route_to_b": "B", "route_to_c": "C"}[
+                action_type
+            ]
+            provider = self._internal.providers[provider_name]
+            self._internal.probed_providers.add(provider_name)
+            # Deduct cost
+            cost = provider.cost_per_request
+            self._internal.budget_dollars -= cost
+            # Check budget exhaustion
+            if self._internal.budget_dollars <= 0:
+                self._internal.budget_dollars = max(0.0, self._internal.budget_dollars)
+                # Terminal penalty
+                reward = -10.0
+                self._internal.episode_done = True
+                self._internal.last_latency_ms = 0.0
+                step_info.update(
+                    {
+                        "request_succeeded": False,
+                        "cost": cost,
+                        "latency_ms": 0.0,
+                        "reward": reward,
+                        "provider": provider_name,
+                        "queue_overflow": False,
+                        "budget_exhausted": True,
+                    }
+                )
+                self._internal.history.append(step_info)
+                self._cumulative_reward += reward
+                obs = self._get_obs()
+                obs.done = True
+                obs.reward = reward
+                obs.metadata = step_info
+                self._emit_log(
+                    "[STEP]",
+                    {
+                        "step": self._internal.current_step,
+                        "action": action_type,
+                        "reward": float(reward),
+                        "done": bool(obs.done),
+                        "observation": self._observation_payload(obs),
+                    },
+                )
+                self._emit_log(
+                    "[END]",
+                    {
+                        "task": self._config.name,
+                        "seed": int(self._current_seed) if self._current_seed is not None else -1,
+                        "episode": self._episode_number,
+                        "total_reward": round(float(self._cumulative_reward), 4),
+                        "score": _reported_score(float(grade_episode(self._internal.history)["overall_score"])),
+                    },
+                )
+                return obs
+            # Determine if request succeeds (based on current_health)
+            request_succeeded = self._rng.random() < provider.current_health
+            provider.total_requests += 1
+            # Update windowed tracking
+            window = self._internal.provider_window[provider_name]
+            window.append(request_succeeded)
+            if len(window) > self._internal.window_size:
+                window.pop(0)
+            if request_succeeded:
+                provider.successful_requests += 1
+            # Compute latency
+            base_lat = provider.base_latency_ms
+            noise = self._rng.gauss(0, self._config.latency_noise_std)
+            # Queue backlog amplifies latency multiplicatively.
+            # At max backlog (norm=1.0), latency increases by 50%.
+            # This makes queue_backlog a causally relevant observation
+            # by indirectly coupling it to reward via SLA breaches.
+            queue_norm = (
+                self._internal.queue_backlog_count / self._internal.max_queue_backlog
+                if self._internal.max_queue_backlog > 0 else 0.0
+            )
+            backlog_amplifier = 1.0 + 0.5 * queue_norm
+            # Failed requests have higher latency (timeout-like behavior)
+            if not request_succeeded:
+                actual_latency = (base_lat + abs(noise) + 200.0) * backlog_amplifier
+            else:
+                actual_latency = max(10.0, (base_lat + noise) * backlog_amplifier)
+            self._internal.last_latency_ms = actual_latency
+            # Queue backlog: failures increase pressure
+            queue_overflow = False
+            if not request_succeeded:
+                self._internal.queue_backlog_count = min(
+                    self._internal.max_queue_backlog,
+                    self._internal.queue_backlog_count + 2,
+                )
+                if (
+                    self._internal.queue_backlog_count
+                    >= self._internal.max_queue_backlog
+                ):
+                    queue_overflow = True
+            else:
+                # Successful request drains queue slightly
+                self._internal.queue_backlog_count = max(
+                    0, self._internal.queue_backlog_count - 1
+                )
+            # Compute reward
+            reward = step_reward(
+                action_type=action_type,
+                request_succeeded=request_succeeded,
+                provider_cost=cost,
+                initial_budget=self._internal.initial_budget_dollars,
+                latency_ms=actual_latency,
+                sla_ceiling_ms=self._config.sla_ceiling_ms,
+            )
+            step_info.update(
+                {
+                    "request_succeeded": request_succeeded,
+                    "cost": cost,
+                    "latency_ms": round(actual_latency, 2),
+                    "reward": reward,
+                    "provider": provider_name,
+                    "queue_overflow": queue_overflow,
+                }
+            )
+        # ── Record history ──
+        self._internal.history.append(step_info)
+        self._cumulative_reward += reward
+        # ── Check episode end ──
+        if self._internal.current_step >= self._internal.max_steps:
+            self._internal.episode_done = True
+        # ── Build observation ──
+        obs = self._get_obs()
+        obs.done = self._internal.episode_done
+        obs.reward = reward
+        obs.metadata = step_info
+        self._emit_log(
+            "[STEP]",
+            {
+                "step": self._internal.current_step,
+                "action": action_type,
+                "reward": float(reward),
+                "done": bool(obs.done),
+                "observation": self._observation_payload(obs),
+            },
+        )
+        if obs.done:
+            self._emit_log(
+                "[END]",
+                {
+                    "task": self._config.name,
+                    "seed": int(self._current_seed) if self._current_seed is not None else -1,
+                    "episode": self._episode_number,
+                    "total_reward": round(float(self._cumulative_reward), 4),
+                    "score": _reported_score(float(grade_episode(self._internal.history)["overall_score"])),
+                },
+            )
+        return obs
+    @property
+    def state(self) -> EnvState:
+        """OpenEnv-compatible state property."""
+        return EnvState(
+            episode_id=self._episode_id,
+            step_count=self._internal.current_step,
+            scenario_name=self._config.name,
+            is_done=self._internal.episode_done,
+        )
+    # ─── Internal methods ──────────────────────────────────────────────
+    def _get_obs(self) -> Observation:
+        """Convert internal state to normalized [0,1] observation."""
+        s = self._internal
+        # Provider status: 0.5 for unprobed (max uncertainty), windowed rate if probed
+        def _probed_status(name: str) -> float:
+            if name not in s.probed_providers:
+                return 0.5
+            return s.get_windowed_success_rate(name)
+        a_status = _probed_status("A")
+        b_status = _probed_status("B")
+        c_status = _probed_status("C")
+        # Budget: fraction remaining
+        if s.initial_budget_dollars > 0:
+            budget_frac = max(0.0, s.budget_dollars / s.initial_budget_dollars)
+        else:
+            budget_frac = 0.0
+        # Queue backlog: normalized
+        if s.max_queue_backlog > 0:
+            queue_norm = s.queue_backlog_count / s.max_queue_backlog
+        else:
+            queue_norm = 0.0
+        # Latency: normalized to SLA ceiling
+        if s.sla_ceiling_ms > 0:
+            latency_norm = s.last_latency_ms / s.sla_ceiling_ms
+        else:
+            latency_norm = 0.0
+        # Step progress
+        if s.max_steps > 0:
+            step_norm = s.current_step / s.max_steps
+        else:
+            step_norm = 0.0
+        return Observation(
+            provider_a_status=max(0.0, min(1.0, a_status)),
+            provider_b_status=max(0.0, min(1.0, b_status)),
+            provider_c_status=max(0.0, min(1.0, c_status)),
+            budget_remaining=max(0.0, min(1.0, budget_frac)),
+            queue_backlog=max(0.0, min(1.0, queue_norm)),
+            system_latency=max(0.0, min(1.0, latency_norm)),
+            step_count=max(0.0, min(1.0, step_norm)),
+        )
+    def _degrade(self) -> None:
+        """
+        Apply stochastic degradation to configured provider(s).
+        The target provider's health decreases based on:
+        - degradation_rate from the TaskConfig
+        - A small random perturbation
+        - Only triggers after actual_degradation_start (jittered per episode)
+        Supports secondary degradation for multi-provider scenarios.
+        """
+        config = self._config
+        step = self._internal.current_step
+        # Primary degradation
+        if step >= self._internal.actual_degradation_start:
+            target = config.degradation_target
+            provider = self._internal.providers.get(target)
+            if provider is not None:
+                noise = self._rng.gauss(0, 0.02)
+                health_reduction = config.degradation_rate + noise
+                provider.current_health = max(
+                    0.05,
+                    provider.current_health - health_reduction,
+                )
+        # Secondary degradation (for multi-provider scenarios)
+        if (
+            config.secondary_degradation_target
+            and step >= self._internal.actual_secondary_degradation_start
+        ):
+            target = config.secondary_degradation_target
+            provider = self._internal.providers.get(target)
+            if provider is not None:
+                noise = self._rng.gauss(0, 0.02)
+                health_reduction = config.secondary_degradation_rate + noise
+                provider.current_health = max(
+                    0.05,
+                    provider.current_health - health_reduction,
+                )

budget_router/models.py ADDED Viewed

	@@ -0,0 +1,212 @@

+from dataclasses import dataclass, field
+from enum import Enum
+from typing import Any, Dict, List, Literal, Optional
+from openenv_core.env_server.types import (
+    Action as BaseAction,
+    Observation as BaseObservation,
+    State as BaseState,
+)
+# =============================================================================
+# Action — extends OpenEnv Action
+# =============================================================================
+class ActionType(str, Enum):
+    """The four possible routing actions."""
+    ROUTE_TO_A = "route_to_a"
+    ROUTE_TO_B = "route_to_b"
+    ROUTE_TO_C = "route_to_c"
+    SHED_LOAD = "shed_load"
+@dataclass(kw_only=True)
+class Action(BaseAction):
+    """
+    Agent action: route a request to a provider or shed load.
+    Extends OpenEnv Action (which provides `metadata` field).
+    """
+    action_type: Literal["route_to_a", "route_to_b", "route_to_c", "shed_load"]
+    def __post_init__(self) -> None:
+        if isinstance(self.action_type, str):
+            self.action_type = ActionType(self.action_type)
+# =============================================================================
+# Observation — extends OpenEnv Observation
+# =============================================================================
+@dataclass(kw_only=True)
+class Observation(BaseObservation):
+    """
+    Agent-visible observation. ALL numeric fields are normalized to [0.0, 1.0].
+    Extends OpenEnv Observation (which provides `done`, `reward`, `metadata` fields).
+    """
+    # Provider health (recent success rates)
+    provider_a_status: float
+    provider_b_status: float
+    provider_c_status: float
+    # Resource state
+    budget_remaining: float
+    queue_backlog: float
+    system_latency: float
+    # Episode progress
+    step_count: float
+    def __post_init__(self) -> None:
+        for field_name in (
+            "provider_a_status",
+            "provider_b_status",
+            "provider_c_status",
+            "budget_remaining",
+            "queue_backlog",
+            "system_latency",
+            "step_count",
+        ):
+            setattr(self, field_name, max(0.0, min(1.0, getattr(self, field_name))))
+# =============================================================================
+# Internal State (raw units, for debugging / trace only)
+# =============================================================================
+@dataclass
+class ProviderState:
+    """Internal state of a single provider in raw units."""
+    name: str
+    base_reliability: float  # initial reliability [0, 1]
+    current_health: float  # current health [0, 1]
+    cost_per_request: float  # dollars
+    base_latency_ms: float  # base latency in ms
+    total_requests: int = 0
+    successful_requests: int = 0
+    @property
+    def observed_success_rate(self) -> float:
+        """Success rate from agent's perspective (windowed)."""
+        if self.total_requests == 0:
+            return self.base_reliability
+        return self.successful_requests / self.total_requests
+@dataclass
+class InternalState:
+    """
+    Full internal state in raw units. NOT exposed to the agent.
+    Used for manual trace, debugging, and the oracle policy.
+    """
+    providers: Dict[str, ProviderState] = field(default_factory=dict)
+    budget_dollars: float = 0.0
+    initial_budget_dollars: float = 0.0
+    queue_backlog_count: int = 0
+    max_queue_backlog: int = 10
+    last_latency_ms: float = 0.0
+    sla_ceiling_ms: float = 500.0
+    current_step: int = 0
+    max_steps: int = 20
+    episode_done: bool = False
+    history: List[Dict[str, Any]] = field(default_factory=list)
+    # Windowed success tracking (last N requests per provider)
+    provider_window: Dict[str, List[bool]] = field(default_factory=dict)
+    window_size: int = 5
+    # Probed providers: tracks which providers have been routed to at least once
+    probed_providers: set = field(default_factory=set)
+    # Resolved (jittered) degradation onsets for this episode
+    actual_degradation_start: int = 0
+    actual_secondary_degradation_start: int = 999
+    def get_windowed_success_rate(self, provider_name: str) -> float:
+        """Get success rate over the last `window_size` requests for a provider."""
+        window = self.provider_window.get(provider_name, [])
+        if not window:
+            return self.providers[provider_name].base_reliability
+        return sum(window) / len(window)
+# =============================================================================
+# Task Configuration
+# =============================================================================
+@dataclass
+class TaskConfig:
+    """
+    Configuration for a task scenario. Passed to reset(scenario=config).
+    NOT a subclass — just a data container.
+    """
+    name: str
+    description: str
+    # Budget
+    initial_budget: float = 5.0  # dollars
+    # Provider costs (per request, dollars)
+    cost_a: float = 0.01
+    cost_b: float = 0.05
+    cost_c: float = 0.10
+    # Provider base reliability
+    reliability_a: float = 0.70
+    reliability_b: float = 0.90
+    reliability_c: float = 0.99
+    # Provider base latency (ms)
+    latency_a: float = 100.0
+    latency_b: float = 150.0
+    latency_c: float = 200.0
+    # SLA
+    sla_ceiling_ms: float = 500.0
+    # Degradation config (primary)
+    degradation_start_step: int = 0  # step at which degradation begins
+    degradation_rate: float = 0.0  # health reduction per step for provider A
+    degradation_target: str = "A"  # which provider degrades
+    degradation_start_jitter: int = 0  # ±jitter applied per episode to degradation_start_step
+    # Secondary degradation (for multi-provider scenarios)
+    secondary_degradation_start_step: int = 999  # 999 = no secondary degradation
+    secondary_degradation_rate: float = 0.0
+    secondary_degradation_target: str = ""  # empty = no secondary degradation
+    secondary_degradation_start_jitter: int = 0  # ±jitter applied per episode to secondary_degradation_start_step
+    # Episode
+    max_steps: int = 20
+    max_queue_backlog: int = 10
+    # Stochastic noise
+    latency_noise_std: float = 30.0  # ms std dev added to base latency
+# =============================================================================
+# OpenEnv State — extends BaseState
+# =============================================================================
+@dataclass
+class EnvState(BaseState):
+    """
+    OpenEnv-compatible state object returned by the `state` property.
+    Extends BaseState (which provides `episode_id`, `step_count` fields).
+    """
+    scenario_name: str = ""
+    is_done: bool = False

budget_router/policies.py ADDED Viewed

	@@ -0,0 +1,141 @@

+"""
+Policies for the Budget Router environment.
+6 policies:
+- random_policy: uniform random baseline (lower bound)
+- heuristic_baseline_policy: stateless cheapest-viable routing
+- debug_upper_bound_policy: oracle with internal state access (test only)
+- always_route_a_policy: degenerate (always cheapest)
+- always_route_b_policy: degenerate (always balanced fallback)
+- always_route_c_policy: degenerate (always most reliable)
+- always_shed_load_policy: degenerate (always shed)
+"""
+from __future__ import annotations
+import random as stdlib_random
+from typing import Optional
+from .models import Action, ActionType, InternalState, Observation
+from .reward import BUDGET_WEIGHT
+def random_policy(obs: Observation, rng: Optional[stdlib_random.Random] = None) -> Action:
+    """Uniform random over all 4 actions. No state awareness."""
+    r = rng or stdlib_random.Random()
+    choice = r.choice(list(ActionType))
+    return Action(action_type=choice)
+def heuristic_baseline_policy(obs: Observation) -> Action:
+    """
+    Stateless heuristic: prefer cheapest provider with status > threshold.
+    Fallback to next cheapest. shed_load only if ALL below threshold.
+    Budget-aware: when budget is critically low, only use the cheapest
+    viable provider or shed load to avoid the -10 budget exhaustion penalty.
+    No privileged information. Uses only what the agent can observe.
+    """
+    threshold = 0.52
+    # Providers ordered by cost (cheapest first): A, B, C
+    providers = [
+        ("route_to_a", obs.provider_a_status),
+        ("route_to_b", obs.provider_b_status),
+        ("route_to_c", obs.provider_c_status),
+    ]
+    # Budget safety: when critically low, exclude expensive providers
+    # to prevent the -10.0 terminal budget exhaustion penalty.
+    # Only blocks C ($0.10/req) when budget can't absorb it.
+    if obs.budget_remaining < 0.10:
+        # Only consider A ($0.01) and B ($0.05) — skip C
+        for action_name, status in providers[:2]:
+            if status > threshold or status == 0.5:
+                return Action(action_type=ActionType(action_name))
+        return Action(action_type=ActionType.SHED_LOAD)
+    for action_name, status in providers:
+        if status > threshold or status == 0.5:
+            return Action(action_type=ActionType(action_name))
+    # All providers below threshold → shed load
+    return Action(action_type=ActionType.SHED_LOAD)
+def debug_upper_bound_policy(obs: Observation, internal_state: InternalState) -> Action:
+    """
+    Oracle policy with access to true internal health values.
+    Used ONLY for debugging and validation — NOT a fair benchmark.
+    Strategy: expected-value routing using true health, with hard budget
+    feasibility constraint. Routes to the cheapest provider whose health
+    is high enough, but won't pick an expensive provider if it would
+    exhaust the budget.
+    """
+    initial_budget = internal_state.initial_budget_dollars
+    if initial_budget <= 0:
+        initial_budget = 1.0
+    budget_dollars = internal_state.budget_dollars
+    remaining_steps = max(1, internal_state.max_steps - internal_state.current_step)
+    providers_info = [
+        ("route_to_a", internal_state.providers["A"].current_health,
+         internal_state.providers["A"].cost_per_request),
+        ("route_to_b", internal_state.providers["B"].current_health,
+         internal_state.providers["B"].cost_per_request),
+        ("route_to_c", internal_state.providers["C"].current_health,
+         internal_state.providers["C"].cost_per_request),
+    ]
+    best_action = None
+    best_ev = float("-inf")
+    for action_name, health, cost in providers_info:
+        # Hard feasibility: can we afford this provider for remaining steps?
+        # If not, skip it entirely to avoid budget exhaustion penalty (-10)
+        if cost * remaining_steps > budget_dollars:
+            continue
+        # Expected per-step reward matching reward.py:
+        # P(success) * 1.0 + P(fail) * -2.0 - (cost/initial_budget) * BUDGET_WEIGHT
+        ev = health * 1.0 + (1.0 - health) * (-2.0) - (cost / initial_budget) * BUDGET_WEIGHT
+        if ev > best_ev:
+            best_ev = ev
+            best_action = action_name
+    if best_action is None:
+        # No affordable provider — pick the cheapest one we can still afford once
+        for action_name, health, cost in providers_info:
+            if cost <= budget_dollars:
+                ev = health * 1.0 + (1.0 - health) * (-2.0) - (cost / initial_budget) * BUDGET_WEIGHT
+                if ev > best_ev:
+                    best_ev = ev
+                    best_action = action_name
+    if best_action is None or best_ev < -0.5:
+        return Action(action_type=ActionType.SHED_LOAD)
+    return Action(action_type=ActionType(best_action))
+def always_route_a_policy(obs: Observation) -> Action:
+    """Degenerate: always route to cheapest provider A."""
+    return Action(action_type=ActionType.ROUTE_TO_A)
+def always_route_b_policy(obs: Observation) -> Action:
+    """Degenerate: always route to balanced provider B."""
+    return Action(action_type=ActionType.ROUTE_TO_B)
+def always_route_c_policy(obs: Observation) -> Action:
+    """Degenerate: always route to most expensive/reliable provider C."""
+    return Action(action_type=ActionType.ROUTE_TO_C)
+def always_shed_load_policy(obs: Observation) -> Action:
+    """Degenerate: always shed load (never routes)."""
+    return Action(action_type=ActionType.SHED_LOAD)

budget_router/reward.py ADDED Viewed

	@@ -0,0 +1,281 @@

+"""
+Reward computation for the Budget Router environment.
+Per-step reward (4 additive terms max) and episode-level grader metrics.
+"""
+from __future__ import annotations
+import math
+from typing import Any, Dict, List
+BUDGET_WEIGHT = 5.0  # Scales cost penalty so it's meaningful vs success/failure signal
+def step_reward(
+    action_type: str,
+    request_succeeded: bool,
+    provider_cost: float,
+    initial_budget: float,
+    latency_ms: float,
+    sla_ceiling_ms: float,
+) -> float:
+    """
+    Compute single-step reward. Maximum 4 additive terms.
+    For shed_load: fixed penalty of -0.5 (replaces routing terms).
+    For routing actions:
+      +1.0 if request succeeded, -2.0 if failed
+      -(provider_cost / initial_budget) * BUDGET_WEIGHT as cost penalty
+      -(excess_latency / sla_ceiling_ms) if latency exceeds SLA
+    Returns:
+        float: The step reward. Never returns NaN.
+    """
+    # Safety: prevent NaN from division by zero
+    if initial_budget <= 0:
+        initial_budget = 1.0
+    if sla_ceiling_ms <= 0:
+        sla_ceiling_ms = 1.0
+    # shed_load: flat penalty, no routing terms
+    if action_type == "shed_load":
+        return -0.5
+    reward = 0.0
+    # Term 1: Success / failure
+    if request_succeeded:
+        reward += 1.0
+    else:
+        reward += -2.0
+    # Term 2: Cost penalty (always applied for routing actions)
+    cost_penalty = -(provider_cost / initial_budget) * BUDGET_WEIGHT
+    reward += cost_penalty
+    # Term 3: Latency breach penalty
+    if latency_ms > sla_ceiling_ms:
+        excess = latency_ms - sla_ceiling_ms
+        latency_penalty = -(excess / sla_ceiling_ms)
+        reward += latency_penalty
+    # Safety: NaN guard
+    if math.isnan(reward):
+        reward = -2.0
+    return reward
+def episode_metrics(history: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """
+    Compute deterministic episode-level grader metrics.
+    Args:
+        history: List of step info dicts from the episode.
+    Returns:
+        Dict with grader metrics:
+        - total_reward
+        - success_rate
+        - total_cost_spent
+        - average_latency_ms
+        - sla_met (bool)
+        - queue_overflow_events (int)
+    """
+    if not history:
+        return {
+            "total_reward": 0.0,
+            "success_rate": 0.0,
+            "total_cost_spent": 0.0,
+            "average_latency_ms": 0.0,
+            "sla_met": True,
+            "queue_overflow_events": 0,
+        }
+    total_reward = sum(h.get("reward", 0.0) for h in history)
+    # Only count routing steps (not shed_load) for success rate
+    routing_steps = [h for h in history if h.get("action_type") != "shed_load"]
+    if routing_steps:
+        successes = sum(1 for h in routing_steps if h.get("request_succeeded", False))
+        success_rate = successes / len(routing_steps)
+    else:
+        success_rate = 0.0
+    total_cost = sum(h.get("cost", 0.0) for h in history)
+    latencies = [h.get("latency_ms", 0.0) for h in routing_steps]
+    avg_latency = sum(latencies) / len(latencies) if latencies else 0.0
+    sla_ceiling = history[0].get("sla_ceiling_ms", 500.0)
+    sla_met = all(lat <= sla_ceiling for lat in latencies) if latencies else True
+    queue_overflows = sum(1 for h in history if h.get("queue_overflow", False))
+    return {
+        "total_reward": round(total_reward, 4),
+        "success_rate": round(success_rate, 4),
+        "total_cost_spent": round(total_cost, 4),
+        "average_latency_ms": round(avg_latency, 2),
+        "sla_met": sla_met,
+        "queue_overflow_events": queue_overflows,
+    }
+def grade_episode(history: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """
+    Compute episode-level grader score in [0, 1] with weighted breakdown.
+    overall = 0.30 × success_score
+            + 0.20 × latency_score
+            + 0.15 × budget_score
+            + 0.15 × sla_score
+            + 0.20 × adaptation_score
+    Component definitions:
+        success_score: Fraction of ALL episode steps with a successful routed request.
+            Denominator = total steps (not routed steps), so partial abstention is penalised.
+        latency_score: 1.0 - (avg_latency / sla_ceiling), clamped to [0, 1].
+        budget_score:  Fraction of initial budget NOT spent, clamped to [0, 1].
+        sla_score:     Fraction of routed requests with latency <= sla_ceiling.
+        adaptation_score: Post-degradation success rate — measures whether the
+            agent detected and adapted to provider degradation.
+    Adaptation score window semantics by task:
+        - easy (no degradation):  No post-degradation window exists.
+            adaptation_score = 1.0 (adaptation not required → full marks).
+        - medium (A degrades after step 5): Window = routing steps with
+            step > 5. Measures success rate after A begins failing.
+        - hard (A degrades from step 0): Window = routing steps with
+            step > 1 (one warm-up step allowed). Covers nearly all steps.
+        - hard_multi (A from step 0, B from step 10): Blended score:
+            0.5 × primary_adaptation (steps between primary and secondary)
+            + 0.5 × secondary_adaptation (steps after secondary event).
+    All component scores are clamped to [0.0, 1.0].
+    Args:
+        history: List of step info dicts from the episode.
+    Returns:
+        Dict with 'overall_score' and per-component breakdown.
+    """
+    # Note: step_reward() is shaped for learning signal (dense + budget cliff).
+    # grade_episode() is the semantic evaluation metric. Divergence is intentional.
+    if not history:
+        return {
+            "overall_score": 0.0,
+            "success_score": 0.0,
+            "latency_score": 0.0,
+            "budget_score": 0.0,
+            "sla_score": 0.0,
+            "adaptation_score": 0.0,
+        }
+    metrics = episode_metrics(history)
+    # success_score: fraction of ALL episode steps that resulted in a successful routed request.
+    # Denominator is total steps, not routed steps, so partial abstention is penalised.
+    # A policy that serves 10/20 and succeeds each time scores 0.50, not 1.0.
+    total_steps = len(history)
+    routing_steps = [h for h in history if h.get("action_type") != "shed_load"]
+    routed_successes = sum(1 for h in routing_steps if h.get("request_succeeded", False))
+    success_score = routed_successes / total_steps if total_steps > 0 else 0.0
+    sla_ceiling_ms = float(history[0].get("sla_ceiling_ms", 500.0) or 500.0)
+    avg_latency_ms = float(metrics.get("average_latency_ms", 0.0))
+    if sla_ceiling_ms <= 0:
+        sla_ceiling_ms = 1.0
+    # Fix 1: No routing attempts = no service delivered. Quality scores must reflect this.
+    if routing_steps:
+        latency_score = 1.0 - min(1.0, avg_latency_ms / sla_ceiling_ms)
+        sla_ok = sum(1 for h in routing_steps if float(h.get("latency_ms", 0.0)) <= sla_ceiling_ms)
+        sla_score = sla_ok / len(routing_steps)
+    else:
+        latency_score = 0.0
+        sla_score = 0.0
+    # Budget score: penalize spending relative to initial budget, not theoretical max
+    total_cost = float(metrics.get("total_cost_spent", 0.0))
+    initial_budget = float(history[0].get("initial_budget", 1.0) or 1.0)
+    budget_score = max(0.0, 1.0 - total_cost / initial_budget)
+    # Adaptation score: measures post-degradation success rate.
+    # Directly measures whether the agent detected and adapted to degradation.
+    adaptation_score = 0.0
+    _raw_degrade = history[0].get("degradation_start_step")
+    degradation_start = int(_raw_degrade) if _raw_degrade is not None else 999
+    _raw_secondary = history[0].get("secondary_degradation_start_step")
+    secondary_start = int(_raw_secondary) if _raw_secondary is not None else None
+    if degradation_start < 999:
+        if secondary_start is not None:
+            # Fix 2: hard_multi — blended adaptation across primary and secondary windows
+            primary_window = [h for h in routing_steps
+                              if int(h.get("step", 0)) > max(degradation_start, 1)
+                              and int(h.get("step", 0)) <= secondary_start]
+            secondary_window = [h for h in routing_steps
+                                if int(h.get("step", 0)) > secondary_start]
+            if primary_window:
+                primary_adaptation = sum(1 for h in primary_window if h.get("request_succeeded", False)) / len(primary_window)
+            else:
+                primary_adaptation = 0.0
+            if secondary_window:
+                secondary_adaptation = sum(1 for h in secondary_window if h.get("request_succeeded", False)) / len(secondary_window)
+            else:
+                secondary_adaptation = 0.0
+            if not primary_window and not secondary_window:
+                adaptation_score = 0.0
+            else:
+                adaptation_score = 0.5 * primary_adaptation + 0.5 * secondary_adaptation
+        else:
+            # Single degradation event: existing logic unchanged
+            # Use max(degradation_start, 1) to ensure at least one warm-up step
+            # before post-degradation tracking, even when degradation_start=0
+            post_degrade = [h for h in routing_steps
+                            if int(h.get("step", 0)) > max(degradation_start, 1)]
+            if post_degrade:
+                post_successes = sum(1 for h in post_degrade if h.get("request_succeeded", False))
+                adaptation_score = post_successes / len(post_degrade)
+    else:
+        # No degradation event. Award adaptation based on routing quality instead.
+        # A do-nothing (always shed_load) policy gets 0, not 1.0.
+        if routing_steps:
+            quality_successes = sum(1 for h in routing_steps if h.get("request_succeeded", False))
+            adaptation_score = quality_successes / total_steps  # total_steps denominator penalizes abstention
+        else:
+            adaptation_score = 0.0
+    overall = (
+        0.3 * success_score
+        + 0.2 * latency_score
+        + 0.15 * budget_score
+        + 0.15 * sla_score
+        + 0.2 * adaptation_score
+    )
+    # Hard penalty for budget exhaustion: incomplete episodes are not reliable systems.
+    # A policy that routes aggressively and goes bankrupt at step 17 should not outscore
+    # one that completes all 20 steps. 0.75x preserves partial credit for good routing
+    # before exhaustion, but makes budget-exhausted policies non-competitive.
+    episode_terminated_early = any(h.get('budget_exhausted', False) for h in history)
+    if episode_terminated_early:
+        overall = overall * 0.75
+    overall = max(0.0, min(1.0, overall))
+    return {
+        "overall_score": round(overall, 4),
+        "success_score": round(max(0.0, min(1.0, success_score)), 4),
+        "latency_score": round(max(0.0, min(1.0, latency_score)), 4),
+        "budget_score": round(max(0.0, min(1.0, budget_score)), 4),
+        "sla_score": round(max(0.0, min(1.0, sla_score)), 4),
+        "adaptation_score": round(max(0.0, min(1.0, adaptation_score)), 4),
+    }

budget_router/tasks.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""
+Task preset configurations: EASY, MEDIUM, HARD.
+Each is a TaskConfig instance passed to reset(scenario=config).
+"""
+from .models import TaskConfig
+EASY = TaskConfig(
+    name="easy",
+    description="Stable providers. Cheapest is viable but not dominant. Smart routing wins.",
+    initial_budget=1.0,
+    cost_a=0.01,
+    cost_b=0.05,
+    cost_c=0.10,
+    reliability_a=0.76,           # lowered so always-A isn't dominant; forces routing quality to matter
+    reliability_b=0.92,
+    reliability_c=0.99,
+    latency_a=100.0,
+    latency_b=150.0,
+    latency_c=200.0,
+    sla_ceiling_ms=500.0,
+    degradation_start_step=999,   # effectively no degradation
+    degradation_rate=0.0,
+    degradation_target="A",
+    max_steps=20,
+    max_queue_backlog=10,
+    latency_noise_std=30.0,
+)
+MEDIUM = TaskConfig(
+    name="medium",
+    description="Provider A degrades sharply after step 5. Must adapt routing.",
+    initial_budget=0.95,
+    cost_a=0.01,
+    cost_b=0.05,
+    cost_c=0.10,
+    reliability_a=0.85,
+    reliability_b=0.92,
+    reliability_c=0.99,
+    latency_a=100.0,
+    latency_b=150.0,
+    latency_c=200.0,
+    sla_ceiling_ms=500.0,
+    degradation_start_step=5,
+    degradation_rate=0.15,        # sharp drop after step 5
+    degradation_target="A",
+    max_steps=20,
+    max_queue_backlog=10,
+    latency_noise_std=30.0,
+)
+HARD = TaskConfig(
+    name="hard",
+    description="Provider A degrades aggressively from step 0. Tight budget. High noise. Must diversify immediately.",
+    initial_budget=0.85,
+    cost_a=0.01,
+    cost_b=0.05,
+    cost_c=0.10,
+    reliability_a=0.85,
+    reliability_b=0.92,
+    reliability_c=0.99,
+    latency_a=100.0,
+    latency_b=150.0,
+    latency_c=200.0,
+    sla_ceiling_ms=500.0,
+    degradation_start_step=0,     # degrades from the start
+    degradation_start_jitter=3,
+    degradation_rate=0.15,        # faster than MEDIUM (was 0.08)
+    degradation_target="A",
+    max_steps=20,
+    max_queue_backlog=10,
+    latency_noise_std=50.0,       # significantly more noise (was 40.0)
+)
+HARD_MULTI = TaskConfig(
+    name="hard_multi",
+    description="A degrades from step 0, B degrades from step 10. Multi-provider cascade. Slightly wider budget to reward efficient routing.",
+    initial_budget=1.10,
+    cost_a=0.01,
+    cost_b=0.05,
+    cost_c=0.10,
+    reliability_a=0.85,
+    reliability_b=0.92,
+    reliability_c=0.99,
+    latency_a=100.0,
+    latency_b=150.0,
+    latency_c=200.0,
+    sla_ceiling_ms=500.0,
+    degradation_start_step=0,
+    degradation_start_jitter=3,
+    degradation_rate=0.12,
+    degradation_target="A",
+    secondary_degradation_start_step=10,
+    secondary_degradation_start_jitter=3,
+    secondary_degradation_rate=0.10,
+    secondary_degradation_target="B",
+    max_steps=20,
+    max_queue_backlog=10,
+    latency_noise_std=50.0,
+)
+TASK_PRESETS = {"easy": EASY, "medium": MEDIUM, "hard": HARD, "hard_multi": HARD_MULTI}

budget_router/tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Tests for the Budget Router environment - package init."""

budget_router/tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,502 @@

+"""
+Tests for the Budget Router environment core correctness and reward sanity.
+All tests from <test_requirements> are implemented here.
+"""
+import math
+import random
+import pytest
+from budget_router.environment import BudgetRouterEnv
+from budget_router.models import Action, ActionType, Observation
+from budget_router.policies import (
+    always_route_a_policy,
+    always_route_b_policy,
+    always_route_c_policy,
+    always_shed_load_policy,
+    heuristic_baseline_policy,
+    random_policy,
+)
+from budget_router.reward import step_reward
+from budget_router.tasks import EASY, HARD, HARD_MULTI, MEDIUM
+# ─── Helpers ────────────────────────────────────────────────────────────
+def run_full_episode(env, policy_fn, seed, scenario, policy_name=""):
+    """Run a full episode and return (observations, rewards, done_flag, steps)."""
+    obs = env.reset(seed=seed, scenario=scenario)
+    observations = [obs]
+    rewards = []
+    steps = 0
+    rng = random.Random(seed + 10000) if "random" in policy_name else None
+    while not obs.done and steps < scenario.max_steps:
+        if "random" in policy_name:
+            action = policy_fn(obs, rng=rng)
+        else:
+            action = policy_fn(obs)
+        obs = env.step(action)
+        observations.append(obs)
+        rewards.append(obs.reward)
+        steps += 1
+    return observations, rewards, obs.done, steps
+# ─── Core Correctness Tests ────────────────────────────────────────────
+class TestCoreCorrectness:
+    """Core environment correctness tests."""
+    def test_reset_returns_valid_observation(self):
+        """reset() returns Observation with ALL values in [0.0, 1.0]."""
+        env = BudgetRouterEnv()
+        obs = env.reset(seed=42, scenario=EASY)
+        assert isinstance(obs, Observation)
+        assert 0.0 <= obs.provider_a_status <= 1.0
+        assert 0.0 <= obs.provider_b_status <= 1.0
+        assert 0.0 <= obs.provider_c_status <= 1.0
+        assert 0.0 <= obs.budget_remaining <= 1.0
+        assert 0.0 <= obs.queue_backlog <= 1.0
+        assert 0.0 <= obs.system_latency <= 1.0
+        assert 0.0 <= obs.step_count <= 1.0
+    def test_step_after_reset_no_crash(self):
+        """step() after reset() does not crash and returns valid types."""
+        env = BudgetRouterEnv()
+        obs = env.reset(seed=42, scenario=EASY)
+        action = Action(action_type=ActionType.ROUTE_TO_A)
+        obs = env.step(action)
+        assert isinstance(obs, Observation)
+        assert isinstance(obs.done, bool)
+        assert isinstance(obs.reward, (int, float))
+    def test_step_before_reset_no_crash(self):
+        """step() before reset() auto-initializes so the default OpenEnv web UI is safe."""
+        env = BudgetRouterEnv()
+        action = Action(action_type=ActionType.ROUTE_TO_A)
+        obs = env.step(action)
+        assert isinstance(obs, Observation)
+        assert isinstance(obs.done, bool)
+        assert isinstance(obs.reward, (int, float))
+    def test_episode_terminates_at_or_before_20(self):
+        """Episode terminates at or before step 20."""
+        env = BudgetRouterEnv()
+        for scenario in [EASY, MEDIUM, HARD]:
+            obs = env.reset(seed=42, scenario=scenario)
+            steps = 0
+            while not obs.done and steps < 25:  # give extra margin to catch bugs
+                action = Action(action_type=ActionType.ROUTE_TO_B)
+                obs = env.step(action)
+                steps += 1
+            assert steps <= 20, f"Episode ran {steps} steps on {scenario.name}"
+    def test_deterministic_trajectories_same_seed(self):
+        """Two reset() calls with same seed produce identical full trajectories."""
+        env = BudgetRouterEnv()
+        # Run 1
+        obs1_list, rewards1, _, _ = run_full_episode(
+            env, heuristic_baseline_policy, seed=42, scenario=MEDIUM
+        )
+        # Run 2
+        obs2_list, rewards2, _, _ = run_full_episode(
+            env, heuristic_baseline_policy, seed=42, scenario=MEDIUM
+        )
+        assert len(rewards1) == len(rewards2)
+        for r1, r2 in zip(rewards1, rewards2):
+            assert r1 == r2, f"Rewards differ: {r1} vs {r2}"
+    def test_budget_remaining_never_nan(self):
+        """budget_remaining never returns NaN."""
+        env = BudgetRouterEnv()
+        for scenario in [EASY, MEDIUM, HARD]:
+            observations, _, _, _ = run_full_episode(
+                env, heuristic_baseline_policy, seed=42, scenario=scenario
+            )
+            for obs in observations:
+                assert not math.isnan(obs.budget_remaining), "budget_remaining is NaN"
+    def test_provider_status_in_bounds(self):
+        """All provider_status values stay in [0.0, 1.0] throughout episode."""
+        env = BudgetRouterEnv()
+        for scenario in [EASY, MEDIUM, HARD]:
+            observations, _, _, _ = run_full_episode(
+                env, heuristic_baseline_policy, seed=0, scenario=scenario
+            )
+            for obs in observations:
+                assert 0.0 <= obs.provider_a_status <= 1.0
+                assert 0.0 <= obs.provider_b_status <= 1.0
+                assert 0.0 <= obs.provider_c_status <= 1.0
+    def test_system_latency_not_always_zero(self):
+        """system_latency is NOT always 0.0 across a full episode (dead channel guard)."""
+        env = BudgetRouterEnv()
+        observations, _, _, _ = run_full_episode(
+            env, heuristic_baseline_policy, seed=42, scenario=MEDIUM
+        )
+        # Skip first observation (from reset) — latency may be initial value
+        latencies = [obs.system_latency for obs in observations[1:]]
+        assert any(lat > 0.0 for lat in latencies), "system_latency is always 0.0 — dead channel"
+    def test_all_observation_fields_in_range(self):
+        """All Observation fields remain within [0.0, 1.0] at every step."""
+        env = BudgetRouterEnv()
+        for scenario in [EASY, MEDIUM, HARD]:
+            for seed in [0, 1, 2]:
+                observations, _, _, _ = run_full_episode(
+                    env, heuristic_baseline_policy, seed=seed, scenario=scenario
+                )
+                for obs in observations:
+                    assert 0.0 <= obs.provider_a_status <= 1.0
+                    assert 0.0 <= obs.provider_b_status <= 1.0
+                    assert 0.0 <= obs.provider_c_status <= 1.0
+                    assert 0.0 <= obs.budget_remaining <= 1.0
+                    assert 0.0 <= obs.queue_backlog <= 1.0
+                    assert 0.0 <= obs.system_latency <= 1.0
+                    assert 0.0 <= obs.step_count <= 1.0
+# ─── Reward Sanity Tests ───────────────────────────────────────────────
+class TestRewardSanity:
+    """Reward correctness tests."""
+    def test_shed_load_reward_less_than_successful_route_c(self):
+        """shed_load reward < successful route_to_c reward (holding all else equal)."""
+        shed_r = step_reward("shed_load", False, 0.0, 5.0, 0.0, 500.0)
+        route_c_r = step_reward("route_to_c", True, 0.10, 5.0, 200.0, 500.0)
+        assert shed_r < route_c_r, f"shed ({shed_r}) >= route_c success ({route_c_r})"
+    def test_failed_route_less_than_successful_route(self):
+        """Failed route reward < successful route reward."""
+        failed_r = step_reward("route_to_a", False, 0.01, 5.0, 300.0, 500.0)
+        success_r = step_reward("route_to_a", True, 0.01, 5.0, 100.0, 500.0)
+        assert failed_r < success_r, f"failed ({failed_r}) >= success ({success_r})"
+    def test_route_a_cost_less_than_route_c_cost(self):
+        """route_to_a cost < route_to_c cost in info dict."""
+        env = BudgetRouterEnv()
+        env.reset(seed=42, scenario=EASY)
+        obs_a = env.step(Action(action_type=ActionType.ROUTE_TO_A))
+        cost_a = obs_a.metadata.get("cost", 0)
+        env.reset(seed=42, scenario=EASY)
+        obs_c = env.step(Action(action_type=ActionType.ROUTE_TO_C))
+        cost_c = obs_c.metadata.get("cost", 0)
+        assert cost_a < cost_c, f"cost_a ({cost_a}) >= cost_c ({cost_c})"
+    def test_route_a_under_hard_degradation_lower_cumulative(self):
+        """route_to_a under hard degradation gets lower cumulative reward than route_to_c."""
+        env = BudgetRouterEnv()
+        seeds = [0, 1, 2, 3, 4]
+        total_a = 0.0
+        total_c = 0.0
+        for seed in seeds:
+            _, rewards_a, _, _ = run_full_episode(
+                env, always_route_a_policy, seed=seed, scenario=HARD
+            )
+            total_a += sum(r or 0 for r in rewards_a)
+            _, rewards_c, _, _ = run_full_episode(
+                env, always_route_c_policy, seed=seed, scenario=HARD
+            )
+            total_c += sum(r or 0 for r in rewards_c)
+        assert total_a < total_c, (
+            f"always_route_a ({total_a:.2f}) >= always_route_c ({total_c:.2f}) on HARD"
+        )
+# ─── Degenerate Policy Sanity ──────────────────────────────────────────
+class TestDegeneratePolicySanity:
+    """Degenerate policy tests."""
+    def test_always_route_a_does_not_dominate_baseline_medium(self):
+        """always_route_a does not dominate heuristic baseline on medium across dev seeds."""
+        env = BudgetRouterEnv()
+        seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+        baseline_rewards = []
+        always_a_rewards = []
+        for seed in seeds:
+            _, rewards, _, _ = run_full_episode(
+                env, heuristic_baseline_policy, seed=seed, scenario=MEDIUM
+            )
+            baseline_rewards.append(sum(r or 0 for r in rewards))
+            _, rewards, _, _ = run_full_episode(
+                env, always_route_a_policy, seed=seed, scenario=MEDIUM
+            )
+            always_a_rewards.append(sum(r or 0 for r in rewards))
+        baseline_mean = sum(baseline_rewards) / len(baseline_rewards)
+        always_a_mean = sum(always_a_rewards) / len(always_a_rewards)
+        assert baseline_mean >= always_a_mean, (
+            f"always_route_a ({always_a_mean:.2f}) dominates baseline ({baseline_mean:.2f}) on medium"
+        )
+    def test_always_route_c_does_not_dominate_baseline_overall(self):
+        """always_route_c does not dominate heuristic baseline across all tasks."""
+        env = BudgetRouterEnv()
+        seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+        baseline_total = 0.0
+        always_c_total = 0.0
+        for scenario in [EASY, MEDIUM, HARD]:
+            for seed in seeds:
+                _, rewards, _, _ = run_full_episode(
+                    env, heuristic_baseline_policy, seed=seed, scenario=scenario
+                )
+                baseline_total += sum(r or 0 for r in rewards)
+                _, rewards, _, _ = run_full_episode(
+                    env, always_route_c_policy, seed=seed, scenario=scenario
+                )
+                always_c_total += sum(r or 0 for r in rewards)
+        assert baseline_total >= always_c_total, (
+            f"always_route_c ({always_c_total:.2f}) dominates baseline ({baseline_total:.2f}) overall"
+        )
+    def test_always_route_b_does_not_dominate_baseline_medium(self):
+        """always_route_b does not dominate heuristic baseline on medium across dev seeds."""
+        env = BudgetRouterEnv()
+        seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+        baseline_rewards = []
+        always_b_rewards = []
+        for seed in seeds:
+            _, rewards, _, _ = run_full_episode(
+                env, heuristic_baseline_policy, seed=seed, scenario=MEDIUM
+            )
+            baseline_rewards.append(sum(r or 0 for r in rewards))
+            _, rewards, _, _ = run_full_episode(
+                env, always_route_b_policy, seed=seed, scenario=MEDIUM
+            )
+            always_b_rewards.append(sum(r or 0 for r in rewards))
+        baseline_mean = sum(baseline_rewards) / len(baseline_rewards)
+        always_b_mean = sum(always_b_rewards) / len(always_b_rewards)
+        assert baseline_mean >= always_b_mean, (
+            f"always_route_b ({always_b_mean:.2f}) dominates baseline ({baseline_mean:.2f}) on medium"
+        )
+    def test_always_shed_load_worse_than_baseline_easy(self):
+        """always_shed_load performs materially worse than heuristic baseline on easy."""
+        env = BudgetRouterEnv()
+        seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+        baseline_rewards = []
+        always_shed_rewards = []
+        for seed in seeds:
+            _, rewards, _, _ = run_full_episode(
+                env, heuristic_baseline_policy, seed=seed, scenario=EASY
+            )
+            baseline_rewards.append(sum(r or 0 for r in rewards))
+            _, rewards, _, _ = run_full_episode(
+                env, always_shed_load_policy, seed=seed, scenario=EASY
+            )
+            always_shed_rewards.append(sum(r or 0 for r in rewards))
+        baseline_mean = sum(baseline_rewards) / len(baseline_rewards)
+        shed_mean = sum(always_shed_rewards) / len(always_shed_rewards)
+        assert baseline_mean > shed_mean, (
+            f"always_shed ({shed_mean:.2f}) >= baseline ({baseline_mean:.2f}) on easy"
+        )
+class TestBehavioralGuards:
+    """Behavioral regression tests for the repo's core adaptation claims."""
+    def test_heuristic_outperforms_always_route_a_on_hard_dev_seeds(self):
+        """On HARD, reactive routing must beat the cheapest non-adaptive baseline."""
+        env = BudgetRouterEnv()
+        seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+        heuristic_rewards = []
+        always_a_rewards = []
+        for seed in seeds:
+            _, rewards, _, _ = run_full_episode(
+                env, heuristic_baseline_policy, seed=seed, scenario=HARD
+            )
+            heuristic_rewards.append(sum(r or 0 for r in rewards))
+            _, rewards, _, _ = run_full_episode(
+                env, always_route_a_policy, seed=seed, scenario=HARD
+            )
+            always_a_rewards.append(sum(r or 0 for r in rewards))
+        heuristic_mean = sum(heuristic_rewards) / len(heuristic_rewards)
+        always_a_mean = sum(always_a_rewards) / len(always_a_rewards)
+        assert heuristic_mean > always_a_mean, (
+            f"heuristic ({heuristic_mean:.2f}) must beat always_route_a "
+            f"({always_a_mean:.2f}) on hard dev seeds"
+        )
+    def test_heuristic_completes_hard_multi_without_budget_exhaustion(self):
+        """On HARD_MULTI dev seeds, the baseline should finish without budget bankruptcy."""
+        env = BudgetRouterEnv()
+        seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+        for seed in seeds:
+            _, _, done, steps = run_full_episode(
+                env, heuristic_baseline_policy, seed=seed, scenario=HARD_MULTI
+            )
+            exhausted = any(
+                step.get("budget_exhausted", False) for step in env._internal.history
+            )
+            assert done, f"heuristic did not terminate on hard_multi seed={seed}"
+            assert steps == HARD_MULTI.max_steps, (
+                f"heuristic ended after {steps} steps, expected {HARD_MULTI.max_steps} "
+                f"on hard_multi seed={seed}"
+            )
+            assert not exhausted, (
+                f"heuristic hit budget exhaustion on hard_multi seed={seed}"
+            )
+# ─── Grader Semantic Tests ──────────────────────────────────────────────
+class TestGraderSemantics:
+    """Pin the exact grader semantics changed by the abstention and hard_multi fixes.
+    These tests defend against regressions to grade_episode() — the most
+    judge-sensitive function in the repo.
+    """
+    def _make_step(self, step, action, succeeded, cost, latency, degrade=999, secondary=None):
+        return {
+            "step": step, "action_type": action,
+            "request_succeeded": succeeded, "cost": cost,
+            "latency_ms": latency, "reward": 0.9,
+            "sla_ceiling_ms": 500.0, "initial_budget": 1.0,
+            "degradation_start_step": degrade,
+            "secondary_degradation_start_step": secondary,
+        }
+    def test_pure_abstention_scores_below_0_40_on_easy(self):
+        """A policy that sheds all load must score < 0.40 overall on easy.
+        Before the fix this scored ~0.70 (sla=1.0, latency=1.0 on empty routing set).
+        """
+        from budget_router.reward import grade_episode
+        history = [
+            self._make_step(i, "shed_load", False, 0.0, 0.0, degrade=999)
+            for i in range(1, 21)
+        ]
+        result = grade_episode(history)
+        assert result["overall_score"] < 0.40, (
+            f"Pure abstention scored {result['overall_score']} >= 0.40 on easy "
+            f"(grader exploit not fixed)"
+        )
+        assert result["sla_score"] == 0.0, "sla_score should be 0.0 when no requests routed"
+        assert result["latency_score"] == 0.0, "latency_score should be 0.0 when no requests routed"
+        assert result["success_score"] == 0.0, "success_score should be 0.0 when no requests routed"
+        assert result["budget_score"] == 1.0, "budget_score should be 1.0 when nothing spent"
+        assert result["adaptation_score"] == 0.0, (
+            "adaptation_score should be 0.0 on easy when the policy only sheds load"
+        )
+    def test_partial_abstention_scores_less_than_full_service(self):
+        """A policy that sheds 50% of load must score < a policy that serves all 20 steps.
+        Before the success_score denominator fix, partial abstention could outscore
+        full service because budget_score rewarded not spending.
+        """
+        from budget_router.reward import grade_episode
+        # Mixed: 10 sheds then 10 successful routes
+        mixed = (
+            [self._make_step(i, "shed_load", False, 0.0, 0.0) for i in range(1, 11)]
+            + [self._make_step(i, "route_to_a", True, 0.01, 110.0) for i in range(11, 21)]
+        )
+        # Full service: 20 successful routes
+        full = [self._make_step(i, "route_to_a", True, 0.01, 110.0) for i in range(1, 21)]
+        r_mixed = grade_episode(mixed)
+        r_full = grade_episode(full)
+        assert r_mixed["overall_score"] < r_full["overall_score"], (
+            f"Partial abstention ({r_mixed['overall_score']}) >= full service "
+            f"({r_full['overall_score']}) — grader still rewards low-throughput"
+        )
+        assert r_mixed["success_score"] < r_full["success_score"], (
+            f"success_score should be lower for 10/20 served ({r_mixed['success_score']}) "
+            f"than 20/20 served ({r_full['success_score']})"
+        )
+    def test_hard_multi_adaptation_uses_secondary_window(self):
+        """grade_episode computes blended adaptation for hard_multi (secondary window included).
+        Verifies that secondary_degradation_start_step=10 in step_info causes
+        grade_episode to split the adaptation window at step 10 and blend 0.5/0.5.
+        """
+        from budget_router.reward import grade_episode
+        # Build a hard_multi episode: steps 1-10 primary window (route A, succeeds),
+        # steps 11-20 secondary window (route A, fails — B degraded, agent stuck)
+        history = []
+        for i in range(1, 11):
+            history.append(self._make_step(i, "route_to_a", True, 0.01, 110.0, degrade=0, secondary=10))
+        for i in range(11, 21):
+            history.append(self._make_step(i, "route_to_a", False, 0.01, 700.0, degrade=0, secondary=10))
+        result = grade_episode(history)
+        # primary_window: steps > max(0,1)=1 and <= 10 → steps 2..10 → 9 steps, all succeed → 1.0
+        # secondary_window: steps > 10 �� steps 11..20 → 10 steps, all fail → 0.0
+        # blended = 0.5 * 1.0 + 0.5 * 0.0 = 0.5
+        expected_adaptation = 0.5
+        assert abs(result["adaptation_score"] - expected_adaptation) < 0.01, (
+            f"hard_multi blended adaptation expected ~{expected_adaptation}, "
+            f"got {result['adaptation_score']}"
+        )
+        # Compare with an equivalent hard (non-multi) episode to confirm they diverge
+        history_hard = []
+        for i in range(1, 11):
+            history_hard.append(self._make_step(i, "route_to_a", True, 0.01, 110.0, degrade=0, secondary=None))
+        for i in range(11, 21):
+            history_hard.append(self._make_step(i, "route_to_a", False, 0.01, 700.0, degrade=0, secondary=None))
+        result_hard = grade_episode(history_hard)
+        # hard (no secondary): post_degrade = steps > max(0,1)=1 → steps 2..20 → 19 steps
+        # 9 succeed (steps 2-10), 10 fail (steps 11-20) → 9/19 ≈ 0.473
+        assert result["adaptation_score"] != result_hard["adaptation_score"], (
+            f"hard_multi and hard got identical adaptation_score={result['adaptation_score']} "
+            f"— secondary window not being used"
+        )

budget_router/tests/test_eval_all_seed_selection.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import importlib.util
+from pathlib import Path
+import pytest
+def _load_eval_all():
+    path = Path(__file__).resolve().parents[2] / "eval" / "eval_all.py"
+    spec = importlib.util.spec_from_file_location("eval_all", path)
+    module = importlib.util.module_from_spec(spec)
+    assert spec.loader is not None
+    spec.loader.exec_module(module)
+    return module
+def test_seed_values_override_named_seed_set():
+    eval_all = _load_eval_all()
+    assert eval_all.select_seeds(
+        seed_set="dev",
+        seeds=3,
+        seed_values="200,201,202",
+    ) == [200, 201, 202]
+def test_seed_values_accept_commas_and_whitespace():
+    eval_all = _load_eval_all()
+    assert eval_all.select_seeds(
+        seed_set="heldout",
+        seeds=1,
+        seed_values="200, 201  202",
+    ) == [200, 201, 202]
+def test_seed_values_reject_empty_input():
+    eval_all = _load_eval_all()
+    with pytest.raises(ValueError, match="No explicit seeds"):
+        eval_all.select_seeds(seed_set="dev", seeds=3, seed_values=" , ")

budget_router/tests/test_grpo_training_reward.py ADDED Viewed

	@@ -0,0 +1,154 @@

+import pytest
+# GRPO tests import train/learn_experiment.py, which loads torch, datasets, peft,
+# transformers, trl at module import. Those live under `--extra grpo` (torch alone
+# may exist via `--extra training`, which is not enough).
+for _grpo_mod in ("torch", "datasets", "peft", "transformers", "trl"):
+    pytest.importorskip(_grpo_mod)
+from budget_router.reward import grade_episode
+from train.grpo_env import BudgetRouterGRPOEnv
+from train.learn_experiment import build_dataset, build_system_prompt, reward_func, summarize_training_rollout
+def _step_once(env: BudgetRouterGRPOEnv) -> None:
+    # Any routing action is fine; we just need non-empty history.
+    # Use B as a reasonably stable default.
+    try:
+        env.route_to_b()
+    except ValueError as e:
+        # If an episode somehow terminates early, that's fine for the test harness,
+        # but it would make the "partial episode" test invalid.
+        raise AssertionError(f"Episode ended unexpectedly after one step: {e}") from e
+def _run_to_completion(env: BudgetRouterGRPOEnv) -> None:
+    # Drive the episode until the GRPO wrapper signals completion.
+    while True:
+        try:
+            env.route_to_b()
+        except ValueError:
+            return
+def test_reward_func_empty_history_returns_zero():
+    env = BudgetRouterGRPOEnv()
+    env.reset(scenario="hard_multi", seed=0)
+    rewards = reward_func([env])
+    assert rewards == [0.0]
+def test_reward_func_partial_episode_is_progress_scaled_not_full_grader():
+    env = BudgetRouterGRPOEnv()
+    env.reset(scenario="hard_multi", seed=0)
+    _step_once(env)
+    internal = env._env._internal
+    assert internal.history, "test precondition: history must be non-empty"
+    assert not internal.episode_done, "test precondition: episode must be incomplete"
+    grader = float(grade_episode(internal.history)["overall_score"])
+    progress = internal.current_step / max(1, internal.max_steps)
+    expected = grader * progress
+    # This is the critical regression guard: training reward must not be equal
+    # to the raw grader when the episode is incomplete.
+    rewards = reward_func([env])
+    assert rewards == [pytest.approx(expected, abs=1e-6)]
+    assert rewards[0] != pytest.approx(grader, abs=1e-6)
+def test_reward_func_complete_episode_equals_full_grader():
+    env = BudgetRouterGRPOEnv()
+    env.reset(scenario="hard_multi", seed=0)
+    _run_to_completion(env)
+    internal = env._env._internal
+    assert internal.history, "test precondition: history must be non-empty"
+    assert internal.episode_done, "test precondition: episode must be complete"
+    grader = float(grade_episode(internal.history)["overall_score"])
+    rewards = reward_func([env])
+    assert rewards == [pytest.approx(grader, abs=1e-6)]
+def test_training_rollout_summary_exposes_partial_episode_health():
+    env = BudgetRouterGRPOEnv()
+    env.reset(scenario="hard_multi", seed=0)
+    _step_once(env)
+    _step_once(env)
+    summary = summarize_training_rollout([env])
+    assert summary["env_steps_mean"] == pytest.approx(2.0)
+    assert summary["env_steps_min"] == 2
+    assert summary["env_steps_max"] == 2
+    assert summary["episode_completion_rate"] == 0.0
+    assert summary["progress_mean"] == pytest.approx(0.1)
+    assert summary["raw_grader_mean"] > summary["training_reward_mean"]
+def test_training_rollout_summary_exposes_action_sequence_diversity():
+    same_a = BudgetRouterGRPOEnv()
+    same_b = BudgetRouterGRPOEnv()
+    different = BudgetRouterGRPOEnv()
+    for env in (same_a, same_b, different):
+        env.reset(scenario="hard_multi", seed=0)
+    same_a.route_to_b()
+    same_a.route_to_b()
+    same_b.route_to_b()
+    same_b.route_to_b()
+    different.route_to_a()
+    different.route_to_a()
+    summary = summarize_training_rollout([same_a, same_b, different])
+    assert summary["action_sequences"] == [
+        "route_to_b route_to_b",
+        "route_to_b route_to_b",
+        "route_to_a route_to_a",
+    ]
+    assert summary["unique_action_sequences"] == 2
+    assert summary["action_sequence_counts"] == {
+        "route_to_b route_to_b": 2,
+        "route_to_a route_to_a": 1,
+    }
+def test_grpo_tool_feedback_is_compact_for_multi_turn_budget():
+    env = BudgetRouterGRPOEnv()
+    env.reset(scenario="hard_multi", seed=0)
+    feedback = env.route_to_b()
+    assert len(feedback) < 180
+    assert "steps_left=" in feedback
+    assert "health=" in feedback
+def test_explore_prompt_preserves_tool_format_without_deterministic_policy():
+    prompt = build_system_prompt("explore")
+    assert "<tool_call>" in prompt
+    assert '"name": "route_to_a"' in prompt
+    assert "route_to_a" in prompt
+    assert "route_to_b" in prompt
+    assert "route_to_c" in prompt
+    assert "shed_load" in prompt
+    assert "0.52" not in prompt
+    assert "cheapest healthy provider" not in prompt.lower()
+    assert "Observation:" not in prompt
+    assert "route_to_a route_to_b route_to_c" not in prompt
+def test_build_dataset_uses_requested_prompt_style():
+    dataset = build_dataset(n=1, prompt_style="explore")
+    system_prompt = dataset[0]["prompt"][0]["content"]
+    assert system_prompt == build_system_prompt("explore")

budget_router/tests/test_inference_prompt.py ADDED Viewed

	@@ -0,0 +1,94 @@

+from inference import SYSTEM_PROMPT
+from budget_router.models import Observation
+from inference import LLMRouter
+def test_system_prompt_has_required_structural_sections():
+    upper_prompt = SYSTEM_PROMPT.upper()
+    assert "GOLDEN RULE" in upper_prompt or "DEFAULT STRATEGY" in upper_prompt
+    assert "BUDGET RUNWAY" in upper_prompt
+    assert "TASK PROFILE" in upper_prompt
+    assert "NOISE CALIBRATION" in upper_prompt
+def test_system_prompt_communicates_bankruptcy_consequence():
+    assert "-10" in SYSTEM_PROMPT or "bankruptcy" in SYSTEM_PROMPT.lower()
+    assert "0.500" in SYSTEM_PROMPT or "unobserved" in SYSTEM_PROMPT.lower()
+class _FakeResponse:
+    def __init__(self, content: str) -> None:
+        self.choices = [type("Choice", (), {"message": type("Message", (), {"content": content})()})()]
+class _FakeClient:
+    def with_options(self, **kwargs):
+        return self
+    @property
+    def chat(self):
+        return self
+    @property
+    def completions(self):
+        return self
+    def create(self, **kwargs):
+        return _FakeResponse("route_to_a")
+def test_llm_router_preserves_task_name_on_first_step():
+    router = LLMRouter(api_base_url="https://example.com/v1", model_name="test-model", api_key="test-key")
+    router._client = _FakeClient()
+    router.reset(task_name="hard_multi")
+    obs = Observation(
+        provider_a_status=0.5,
+        provider_b_status=0.5,
+        provider_c_status=0.5,
+        budget_remaining=1.0,
+        queue_backlog=0.0,
+        system_latency=0.2,
+        step_count=0.0,
+    )
+    router.choose_action(obs)
+    assert router._task_name == "hard_multi"
+    assert "task: hard_multi" in router._messages[-2]["content"]
+def test_objective_feedback_mode_includes_previous_step_feedback():
+    router = LLMRouter(
+        api_base_url="https://example.com/v1",
+        model_name="test-model",
+        api_key="test-key",
+        prompt_mode="objective_feedback",
+    )
+    router._client = _FakeClient()
+    router.reset(task_name="hard_multi")
+    obs = Observation(
+        provider_a_status=0.4,
+        provider_b_status=0.7,
+        provider_c_status=0.9,
+        budget_remaining=0.8,
+        queue_backlog=0.1,
+        system_latency=0.4,
+        step_count=0.5,
+        reward=-2.05,
+        metadata={
+            "action_type": "route_to_a",
+            "request_succeeded": False,
+            "cost": 0.01,
+            "latency_ms": 620.0,
+        },
+    )
+    router.choose_action(obs)
+    prompt = router._messages[-2]["content"]
+    assert "previous_step_feedback:" in prompt
+    assert "previous_action: route_to_a" in prompt
+    assert "previous_reward: -2.05" in prompt
+    assert "previous_success: false" in prompt

budget_router/tests/test_trace_episode.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import importlib.util
+from pathlib import Path
+def _load_trace_episode():
+    path = Path(__file__).resolve().parents[2] / "eval" / "trace_episode.py"
+    spec = importlib.util.spec_from_file_location("trace_episode", path)
+    module = importlib.util.module_from_spec(spec)
+    assert spec.loader is not None
+    spec.loader.exec_module(module)
+    return module
+def test_trace_episode_returns_step_rows_and_scores_for_heuristic():
+    trace_episode = _load_trace_episode()
+    result = trace_episode.trace_episode(task_name="hard_multi", seed=3, policy_name="heuristic")
+    assert result["task"] == "hard_multi"
+    assert result["seed"] == 3
+    assert result["policy"] == "heuristic"
+    assert result["steps"]
+    assert len(result["steps"]) == result["episode_length"]
+    assert result["total_reward"] == round(sum(step["reward"] for step in result["steps"]), 4)
+    assert 0.0 <= result["grader"]["overall_score"] <= 1.0
+    assert {"success_rate", "total_cost_spent", "average_latency_ms"}.issubset(result["metrics"])
+    assert {
+        "provider_a_status",
+        "provider_b_status",
+        "provider_c_status",
+        "observed_budget_remaining",
+    }.issubset(result["steps"][0])
+def test_trace_episode_rejects_unknown_policy():
+    trace_episode = _load_trace_episode()
+    try:
+        trace_episode.trace_episode(task_name="hard_multi", seed=3, policy_name="unknown")
+    except ValueError as exc:
+        assert "Unknown policy" in str(exc)
+    else:
+        raise AssertionError("unknown policy should raise ValueError")

budget_router/tests/test_validation.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""
+Tests for the validation harness.
+Covers: policy ordering, solvability, NaN safety, baseline stability,
+and hard task crash resistance.
+"""
+import math
+import random
+import pytest
+from budget_router.environment import BudgetRouterEnv
+from budget_router.models import Action, ActionType
+from budget_router.policies import (
+    always_route_a_policy,
+    always_route_b_policy,
+    always_route_c_policy,
+    always_shed_load_policy,
+    debug_upper_bound_policy,
+    heuristic_baseline_policy,
+    random_policy,
+)
+from budget_router.tasks import EASY, HARD, MEDIUM
+from budget_router.validation import DEVELOPMENT_SEEDS, HELDOUT_SEEDS, run_episode
+# ─── Helpers ────────────────────────────────────────────────────────────
+def mean_reward_over_seeds(policy_fn, scenario, seeds, policy_name=""):
+    """Compute mean total reward for a policy across seeds."""
+    env = BudgetRouterEnv()
+    rewards = []
+    for seed in seeds:
+        metrics = run_episode(env, policy_fn, seed, scenario, policy_name=policy_name)
+        rewards.append(metrics["total_reward"])
+    return sum(rewards) / len(rewards), rewards
+# ─── Validation Tests ──────────────────────────────────────────────────
+class TestValidation:
+    """Validation-level tests."""
+    def test_baseline_beats_random_easy_dev(self):
+        """Baseline beats random on easy task across development seeds."""
+        baseline_mean, _ = mean_reward_over_seeds(
+            heuristic_baseline_policy, EASY, DEVELOPMENT_SEEDS
+        )
+        random_mean, _ = mean_reward_over_seeds(
+            random_policy, EASY, DEVELOPMENT_SEEDS, policy_name="random"
+        )
+        assert baseline_mean > random_mean, (
+            f"baseline ({baseline_mean:.2f}) <= random ({random_mean:.2f}) on easy"
+        )
+    def test_upper_bound_beats_baseline_easy_dev(self):
+        """Upper bound beats or matches baseline on easy task across dev seeds."""
+        baseline_mean, _ = mean_reward_over_seeds(
+            heuristic_baseline_policy, EASY, DEVELOPMENT_SEEDS
+        )
+        ub_mean, _ = mean_reward_over_seeds(
+            debug_upper_bound_policy, EASY, DEVELOPMENT_SEEDS, policy_name="upper_bound"
+        )
+        assert ub_mean >= baseline_mean, (
+            f"oracle ({ub_mean:.2f}) < baseline ({baseline_mean:.2f}) on easy"
+        )
+    def test_easy_solvable_positive_reward(self):
+        """Easy task is solvable: baseline achieves positive total reward on seed=42."""
+        env = BudgetRouterEnv()
+        metrics = run_episode(env, heuristic_baseline_policy, seed=42, scenario=EASY)
+        assert metrics["total_reward"] > 0, (
+            f"baseline achieves {metrics['total_reward']:.2f} on easy/seed=42"
+        )
+    def test_hard_no_crash_dev_seeds(self):
+        """Hard task terminates without environment crash on development_seeds."""
+        env = BudgetRouterEnv()
+        for seed in DEVELOPMENT_SEEDS:
+            try:
+                metrics = run_episode(
+                    env, heuristic_baseline_policy, seed=seed, scenario=HARD
+                )
+                assert metrics["episode_length"] <= 20
+            except Exception as e:
+                pytest.fail(f"Hard task crashed on seed {seed}: {e}")
+    def test_no_nan_rewards_all_combos(self):
+        """No reward is NaN across all (task, policy, seed_set) combinations."""
+        env = BudgetRouterEnv()
+        policies = {
+            "random": random_policy,
+            "heuristic_baseline": heuristic_baseline_policy,
+            "upper_bound": debug_upper_bound_policy,
+            "always_route_a": always_route_a_policy,
+            "always_route_b": always_route_b_policy,
+            "always_route_c": always_route_c_policy,
+            "always_shed_load": always_shed_load_policy,
+        }
+        for scenario in [EASY, MEDIUM, HARD]:
+            for policy_name, policy_fn in policies.items():
+                for seed in DEVELOPMENT_SEEDS[:3]:  # subset for speed
+                    metrics = run_episode(
+                        env, policy_fn, seed, scenario, policy_name=policy_name
+                    )
+                    assert not math.isnan(metrics["total_reward"]), (
+                        f"NaN reward: {scenario.name}/{policy_name}/seed={seed}"
+                    )
+    def test_baseline_stability_heldout(self):
+        """Baseline remains within reasonable stability margin on heldout seeds."""
+        for scenario in [EASY, MEDIUM, HARD]:
+            dev_mean, _ = mean_reward_over_seeds(
+                heuristic_baseline_policy, scenario, DEVELOPMENT_SEEDS
+            )
+            heldout_mean, _ = mean_reward_over_seeds(
+                heuristic_baseline_policy, scenario, HELDOUT_SEEDS
+            )
+            margin = max(2.0, 0.40 * abs(dev_mean))
+            assert abs(heldout_mean - dev_mean) <= margin, (
+                f"Baseline unstable on {scenario.name}: "
+                f"dev={dev_mean:.2f}, heldout={heldout_mean:.2f}, margin={margin:.2f}"
+            )
+    def test_baseline_beats_always_route_b_dev(self):
+        """Baseline beats always_route_b on all tasks across development seeds."""
+        for scenario in [EASY, MEDIUM, HARD]:
+            baseline_mean, _ = mean_reward_over_seeds(
+                heuristic_baseline_policy, scenario, DEVELOPMENT_SEEDS
+            )
+            always_b_mean, _ = mean_reward_over_seeds(
+                always_route_b_policy, scenario, DEVELOPMENT_SEEDS
+            )
+            assert baseline_mean >= always_b_mean, (
+                f"baseline ({baseline_mean:.2f}) < always_route_b ({always_b_mean:.2f}) on {scenario.name}"
+            )

budget_router/validation.py ADDED Viewed

	@@ -0,0 +1,424 @@

+"""
+Validation harness for the Budget Router environment.
+- run_validation(): runs all policies across all tasks and seed sets
+- run_manual_trace(): step-by-step debug trace
+- assert_all_checks(): hard assertions that must pass before submission
+- print_results_table(): formatted results display
+"""
+from __future__ import annotations
+import math
+import random
+from typing import Any, Callable, Dict, List, Optional, Tuple
+from .environment import BudgetRouterEnv
+from .models import Action, ActionType, InternalState, Observation, TaskConfig
+from .policies import (
+    always_route_a_policy,
+    always_route_b_policy,
+    always_route_c_policy,
+    always_shed_load_policy,
+    debug_upper_bound_policy,
+    heuristic_baseline_policy,
+    random_policy,
+)
+from .reward import episode_metrics
+from .tasks import EASY, HARD, HARD_MULTI, MEDIUM, TASK_PRESETS
+# ─── Seed sets ──────────────────────────────────────────────────────────
+DEVELOPMENT_SEEDS = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+HELDOUT_SEEDS = [100, 101, 102, 103, 104]
+# ─── Episode runner ─────────────────────────────────────────────────────
+def run_episode(
+    env: BudgetRouterEnv,
+    policy_fn: Callable,
+    seed: int,
+    scenario: TaskConfig,
+    policy_name: str = "",
+) -> Dict[str, Any]:
+    """Run a single episode and return metrics."""
+    obs = env.reset(seed=seed, scenario=scenario)
+    # For random policy, seed a separate RNG
+    policy_rng = random.Random(seed + 10000) if "random" in policy_name else None
+    total_reward = 0.0
+    steps = 0
+    while not obs.done and steps < scenario.max_steps:
+        # Select action based on policy
+        if "upper_bound" in policy_name:
+            action = policy_fn(obs, env._internal)
+        elif "random" in policy_name:
+            action = policy_fn(obs, rng=policy_rng)
+        else:
+            action = policy_fn(obs)
+        obs = env.step(action)
+        total_reward += (obs.reward or 0.0)
+        steps += 1
+    metrics = episode_metrics(env._internal.history)
+    metrics["total_reward"] = round(total_reward, 4)
+    metrics["episode_length"] = steps
+    return metrics
+# ─── Validation runner ──────────────────────────────────────────────────
+def run_validation(seed_set_name: str = "development") -> Dict[str, Dict[str, Dict[str, Any]]]:
+    """
+    Run all 6 policies on all 3 tasks for the given seed set.
+    Returns:
+        Nested dict: results[task_name][policy_name] = {
+            "mean_reward", "std_reward", "min_reward", "max_reward",
+            "success_rate", "average_cost", "average_latency",
+            "all_rewards", "all_budgets", "all_lengths"
+        }
+    """
+    seeds = DEVELOPMENT_SEEDS if seed_set_name == "development" else HELDOUT_SEEDS
+    policies = {
+        "random": random_policy,
+        "heuristic_baseline": heuristic_baseline_policy,
+        "upper_bound": debug_upper_bound_policy,
+        "always_route_a": always_route_a_policy,
+        "always_route_b": always_route_b_policy,
+        "always_route_c": always_route_c_policy,
+        "always_shed_load": always_shed_load_policy,
+    }
+    tasks = {"easy": EASY, "medium": MEDIUM, "hard": HARD, "hard_multi": HARD_MULTI}
+    results: Dict[str, Dict[str, Dict[str, Any]]] = {}
+    env = BudgetRouterEnv()
+    for task_name, task_config in tasks.items():
+        results[task_name] = {}
+        for policy_name, policy_fn in policies.items():
+            all_rewards = []
+            all_success_rates = []
+            all_costs = []
+            all_latencies = []
+            all_lengths = []
+            for seed in seeds:
+                metrics = run_episode(
+                    env, policy_fn, seed, task_config, policy_name=policy_name
+                )
+                all_rewards.append(metrics["total_reward"])
+                all_success_rates.append(metrics["success_rate"])
+                all_costs.append(metrics["total_cost_spent"])
+                all_latencies.append(metrics["average_latency_ms"])
+                all_lengths.append(metrics["episode_length"])
+            mean_r = sum(all_rewards) / len(all_rewards)
+            std_r = (
+                sum((r - mean_r) ** 2 for r in all_rewards) / len(all_rewards)
+            ) ** 0.5
+            results[task_name][policy_name] = {
+                "mean_reward": round(mean_r, 4),
+                "std_reward": round(std_r, 4),
+                "min_reward": round(min(all_rewards), 4),
+                "max_reward": round(max(all_rewards), 4),
+                "success_rate": round(
+                    sum(all_success_rates) / len(all_success_rates), 4
+                ),
+                "average_cost": round(sum(all_costs) / len(all_costs), 4),
+                "average_latency": round(
+                    sum(all_latencies) / len(all_latencies), 2
+                ),
+                "all_rewards": all_rewards,
+                "all_lengths": all_lengths,
+            }
+    return results
+# ─── Results printer ────────────────────────────────────────────────────
+def print_results_table(results: Dict, seed_set_name: str = "development") -> None:
+    """Print formatted results table."""
+    print(f"\n{'='*90}")
+    print(f"  VALIDATION RESULTS — {seed_set_name.upper()} SEEDS")
+    print(f"{'='*90}")
+    for task_name, policies in results.items():
+        print(f"\n  Task: {task_name.upper()}")
+        print(f"  {'Policy':<20} {'Mean':>8} {'Std':>8} {'Min':>8} {'Max':>8} {'SucRate':>8} {'Cost':>8} {'Lat(ms)':>8}")
+        print(f"  {'-'*76}")
+        for policy_name, stats in policies.items():
+            print(
+                f"  {policy_name:<20} "
+                f"{stats['mean_reward']:>8.2f} "
+                f"{stats['std_reward']:>8.2f} "
+                f"{stats['min_reward']:>8.2f} "
+                f"{stats['max_reward']:>8.2f} "
+                f"{stats['success_rate']:>8.2f} "
+                f"{stats['average_cost']:>8.4f} "
+                f"{stats['average_latency']:>8.1f}"
+            )
+    print(f"\n{'='*90}")
+# ─── Manual Trace ──────────────────────────────────────────────────────
+def run_manual_trace(
+    seed: int = 42,
+    scenario_name: str = "medium",
+    policy_fn: Optional[Callable] = None,
+    policy_name: str = "heuristic_baseline",
+) -> None:
+    """
+    Run a single episode with step-by-step trace in raw internal units.
+    PRIMARY debugging tool.
+    """
+    scenario = TASK_PRESETS[scenario_name]
+    policy = policy_fn or heuristic_baseline_policy
+    env = BudgetRouterEnv()
+    obs = env.reset(seed=seed, scenario=scenario)
+    policy_rng = random.Random(seed + 10000)
+    print(f"\n{'─'*95}")
+    print(f"  MANUAL TRACE — Scenario: {scenario_name.upper()}, Seed: {seed}, Policy: {policy_name}")
+    print(f"{'─'*95}")
+    print(
+        f"  {'Step':>4} | {'Action':<10} | {'A_health':>8} | {'B_health':>8} | {'C_health':>8} | "
+        f"{'Latency':>8} | {'Budget$':>8} | {'Reward':>7} | {'Cumul':>7}"
+    )
+    print(f"  {'─'*91}")
+    cumulative = 0.0
+    steps = 0
+    while not obs.done and steps < scenario.max_steps:
+        if "upper_bound" in policy_name:
+            action = policy(obs, env._internal)
+        elif "random" in policy_name:
+            action = policy(obs, rng=policy_rng)
+        else:
+            action = policy(obs)
+        obs = env.step(action)
+        steps += 1
+        reward = obs.reward or 0.0
+        cumulative += reward
+        # Read raw internal state for trace
+        s = env._internal
+        a_health = s.providers["A"].current_health
+        b_health = s.providers["B"].current_health
+        c_health = s.providers["C"].current_health
+        latency_ms = s.last_latency_ms
+        budget = s.budget_dollars
+        print(
+            f"  {steps:>4} | {action.action_type.value:<10} | "
+            f"{a_health:>8.3f} | {b_health:>8.3f} | {c_health:>8.3f} | "
+            f"{latency_ms:>6.0f}ms | ${budget:>7.2f} | "
+            f"{reward:>+7.2f} | {cumulative:>+7.2f}"
+        )
+    print(f"  {'─'*91}")
+    metrics = episode_metrics(env._internal.history)
+    print(
+        f"  EPISODE END | "
+        f"success_rate={metrics['success_rate']:.2f} | "
+        f"total_cost=${metrics['total_cost_spent']:.4f} | "
+        f"sla_met={metrics['sla_met']} | "
+        f"total_reward={cumulative:.2f}"
+    )
+    print(f"{'─'*95}\n")
+# ─── Hard Assertions ───────────────────────────────────────────────────
+def assert_all_checks(
+    dev_results: Dict[str, Dict[str, Dict[str, Any]]],
+    heldout_results: Dict[str, Dict[str, Dict[str, Any]]],
+) -> None:
+    """
+    Run all hard assertions. All must pass before submission.
+    If any fails, fix the environment — do not weaken the assertion.
+    """
+    print("\n" + "=" * 60)
+    print("  RUNNING HARD ASSERTION CHECKS")
+    print("=" * 60)
+    passed = 0
+    failed = 0
+    total = 0
+    def check(condition: bool, msg: str) -> None:
+        nonlocal passed, failed, total
+        total += 1
+        if condition:
+            passed += 1
+            print(f"  ✅ PASS: {msg}")
+        else:
+            failed += 1
+            print(f"  ❌ FAIL: {msg}")
+    # ── Policy ordering (BOTH seed sets, ALL tasks) ──
+    # Note: hard_multi baseline > random only required on dev seeds —
+    # heldout random can occasionally beat the deterministic heuristic on hard_multi
+    for seed_set_name, results in [("dev", dev_results), ("heldout", heldout_results)]:
+        for task in ["easy", "medium", "hard"]:
+            baseline_mean = results[task]["heuristic_baseline"]["mean_reward"]
+            random_mean = results[task]["random"]["mean_reward"]
+            upper_bound_mean = results[task]["upper_bound"]["mean_reward"]
+            check(
+                baseline_mean > random_mean,
+                f"[{seed_set_name}/{task}] baseline ({baseline_mean:.2f}) > random ({random_mean:.2f})",
+            )
+            check(
+                upper_bound_mean >= baseline_mean,
+                f"[{seed_set_name}/{task}] oracle ({upper_bound_mean:.2f}) >= baseline ({baseline_mean:.2f})",
+            )
+        # hard_multi: only check oracle >= baseline (heuristic fails by design)
+        hm_baseline = results["hard_multi"]["heuristic_baseline"]["mean_reward"]
+        hm_oracle = results["hard_multi"]["upper_bound"]["mean_reward"]
+        check(
+            hm_oracle >= hm_baseline,
+            f"[{seed_set_name}/hard_multi] oracle ({hm_oracle:.2f}) >= baseline ({hm_baseline:.2f})",
+        )
+    # ── Non-triviality ──
+    found_nontrivial = False
+    for task in ["easy", "medium", "hard", "hard_multi"]:
+        baseline_mean = dev_results[task]["heuristic_baseline"]["mean_reward"]
+        random_mean = dev_results[task]["random"]["mean_reward"]
+        if abs(random_mean) > 0:
+            gap = (baseline_mean - random_mean) / abs(random_mean)
+        else:
+            gap = abs(baseline_mean - random_mean)
+        if gap > 0.20:
+            found_nontrivial = True
+            break
+    check(found_nontrivial, "At least one task has >20% gap between baseline and random")
+    # ── Solvability ──
+    easy_ub_reward = dev_results["easy"]["upper_bound"]["mean_reward"]
+    easy_ub_sr = dev_results["easy"]["upper_bound"]["success_rate"]
+    check(easy_ub_reward > 0, f"Oracle positive reward on easy ({easy_ub_reward:.2f})")
+    check(easy_ub_sr > 0.5, f"Oracle success rate on easy ({easy_ub_sr:.2f}) > 0.5")
+    # ── Anti-gaming checks (hard_multi excluded — heuristic fails by design) ──
+    for task in ["easy", "medium", "hard"]:
+        baseline_mean = dev_results[task]["heuristic_baseline"]["mean_reward"]
+        always_a_mean = dev_results[task]["always_route_a"]["mean_reward"]
+        always_b_mean = dev_results[task]["always_route_b"]["mean_reward"]
+        always_shed_mean = dev_results[task]["always_shed_load"]["mean_reward"]
+        check(
+            baseline_mean >= always_a_mean,
+            f"[dev/{task}] baseline ({baseline_mean:.2f}) >= always_a ({always_a_mean:.2f})",
+        )
+        check(
+            baseline_mean >= always_b_mean,
+            f"[dev/{task}] baseline ({baseline_mean:.2f}) >= always_b ({always_b_mean:.2f})",
+        )
+        check(
+            baseline_mean >= always_shed_mean,
+            f"[dev/{task}] baseline ({baseline_mean:.2f}) >= always_shed ({always_shed_mean:.2f})",
+        )
+    # Check that NOT all degenerate policies dominate baseline
+    for task in ["easy", "medium", "hard", "hard_multi"]:
+        baseline_mean = dev_results[task]["heuristic_baseline"]["mean_reward"]
+        always_a = dev_results[task]["always_route_a"]["mean_reward"]
+        always_b = dev_results[task]["always_route_b"]["mean_reward"]
+        always_c = dev_results[task]["always_route_c"]["mean_reward"]
+        always_shed = dev_results[task]["always_shed_load"]["mean_reward"]
+        check(
+            not (
+                always_a >= baseline_mean
+                and always_b >= baseline_mean
+                and always_c >= baseline_mean
+                and always_shed >= baseline_mean
+            ),
+            f"[dev/{task}] heuristic provides strategic advantage over degenerate policies",
+        )
+    # ── Held-out robustness ──
+    for task in ["easy", "medium", "hard", "hard_multi"]:
+        baseline_dev = dev_results[task]["heuristic_baseline"]["mean_reward"]
+        baseline_heldout = heldout_results[task]["heuristic_baseline"]["mean_reward"]
+        margin = max(2.0, 0.40 * abs(baseline_dev))
+        check(
+            abs(baseline_heldout - baseline_dev) <= margin,
+            f"[{task}] baseline stable: dev={baseline_dev:.2f}, heldout={baseline_heldout:.2f}, margin={margin:.2f}",
+        )
+    # ── Safety: NaN, budget explosion, infinite loops ──
+    all_rewards = []
+    all_lengths = []
+    for seed_set_name, results in [("dev", dev_results), ("heldout", heldout_results)]:
+        for task in ["easy", "medium", "hard"]:
+            for policy_name, stats in results[task].items():
+                all_rewards.extend(stats["all_rewards"])
+                all_lengths.extend(stats["all_lengths"])
+    check(
+        all(not math.isnan(r) for r in all_rewards),
+        f"No NaN rewards across {len(all_rewards)} episodes",
+    )
+    check(
+        all(ep_len <= 20 for ep_len in all_lengths),
+        f"No episode exceeds 20 steps (max seen: {max(all_lengths) if all_lengths else 0})",
+    )
+    # ── Summary ──
+    print(f"\n{'='*60}")
+    print(f"  RESULTS: {passed}/{total} passed, {failed}/{total} failed")
+    print(f"{'='*60}")
+    if failed > 0:
+        print(f"\n  ⚠️  {failed} assertion(s) FAILED. Fix the environment before submission.")
+    else:
+        print(f"\n  🎉 All assertions passed! Environment is ready for submission.")
+# ─── Main entry point ──────────────────────────────────────────────────
+def main() -> None:
+    """Run full validation suite."""
+    # Run both seed sets
+    print("Running validation on DEVELOPMENT seeds...")
+    dev_results = run_validation("development")
+    print_results_table(dev_results, "development")
+    print("\nRunning validation on HELD-OUT seeds...")
+    heldout_results = run_validation("heldout")
+    print_results_table(heldout_results, "heldout")
+    # Manual trace
+    run_manual_trace(seed=42, scenario_name="medium")
+    run_manual_trace(seed=42, scenario_name="hard_multi")
+    # Hard assertions
+    assert_all_checks(dev_results, heldout_results)
+if __name__ == "__main__":
+    main()

check_leak.py ADDED Viewed

	@@ -0,0 +1,181 @@

+"""
+check_leak.py — Validates BudgetRouterGRPOEnv before GRPO training.
+Checks:
+  1. Tool methods return strings (not crash).
+  2. Episode ends gracefully via ValueError (TRL-idiomatic done signal).
+  3. Reward is a float in [0, 1] — not a dict, not NaN.
+  4. History uses actual_degradation_start (jittered) — NOT the config constant.
+     This proves grade_episode() will compute correct adaptation windows.
+  5. 10-step reward trajectory printed: verify no explosion/vanishing.
+  6. Provider status IS present in tool responses (intentional — text interface needs it).
+Run:
+    uv run python check_leak.py
+"""
+import sys
+def main() -> None:
+    try:
+        from train.grpo_env import BudgetRouterGRPOEnv
+        from budget_router.reward import grade_episode
+        from budget_router.tasks import HARD_MULTI
+    except ImportError as e:
+        print(f"[FAIL] Import error: {e}")
+        sys.exit(1)
+    print("=" * 60)
+    print("BudgetRouterGRPOEnv — Pre-training Validation")
+    print("=" * 60)
+    # ── Check 0: transformers version (soft warning — required for environment_factory) ──
+    print("\n[CHECK 0] transformers version (required for environment_factory)...")
+    try:
+        import transformers
+        ver_str = transformers.__version__
+        # TRL's environment_factory requires transformers >= 4.47.0 (confirmed shipping in
+        # stable builds as of Apr 2026). Exact minimum threshold is version-specific to TRL.
+        # If not installed, training will fail at import time — caught here early.
+        print(f"  ✅ transformers=={ver_str} installed.")
+        # Soft check: warn if below 4.47 (minimum known to ship environment_factory support)
+        major, minor = int(ver_str.split(".")[0]), int(ver_str.split(".")[1])
+        if major < 4 or (major == 4 and minor < 47):
+            print(
+                f"  ⚠️  WARNING: transformers {ver_str} may be too old for environment_factory.\n"
+                f"     Recommended: pip install 'transformers>=4.47.0' or install from main."
+            )
+    except ImportError:
+        print(
+            "  ⚠️  WARNING: transformers is NOT installed in this venv.\n"
+            "     Install before GRPO training: pip install 'transformers>=4.47.0' trl accelerate peft"
+        )
+    # ── Check 1: reset() returns a non-empty string ─────────────────────
+    print("\n[CHECK 1] reset() returns rich text observation...")
+    env = BudgetRouterGRPOEnv()
+    obs_text = env.reset(scenario="hard_multi", seed=42)
+    assert isinstance(obs_text, str) and len(obs_text) > 10, \
+        f"reset() should return non-empty string, got: {obs_text!r}"
+    assert "Budget" in obs_text, "reset() should mention Budget"
+    assert "Provider" in obs_text, "reset() should include provider status (text interface, not sanitized)"
+    print(f"  ✅ reset() returned {len(obs_text)} chars. Provider status PRESENT (correct for text interface).")
+    print(f"  Preview: {obs_text[:120].replace(chr(10), ' ')}...")
+    # ── Check 2: Tool methods return strings step-by-step ───────────────
+    print("\n[CHECK 2] Tool methods return strings and accumulate history...")
+    env2 = BudgetRouterGRPOEnv()
+    env2.reset(scenario="hard_multi", seed=42)
+    step_results = []
+    episode_done = False
+    for step in range(25):  # more than max_steps to test guard
+        action_fn = [env2.route_to_a, env2.route_to_b, env2.shed_load, env2.route_to_b][step % 4]
+        try:
+            result = action_fn()
+            assert isinstance(result, str), f"Tool method should return str, got {type(result)}"
+            step_results.append(result)
+            print(f"  Step {step + 1:02d}: ✅ {result[:80].replace(chr(10), ' ')}...")
+        except ValueError as e:
+            episode_done = True
+            print(f"  Step {step + 1:02d}: ✅ Episode ended via ValueError (TRL-idiomatic): {str(e)[:80]}...")
+            break
+    assert episode_done, "Episode should end with ValueError before step 25"
+    assert len(step_results) > 0, "At least one tool step should complete"
+    print(f"  ✅ Episode ended correctly after {len(step_results)} tool calls.")
+    # ── Check 3: Reward is float in [0, 1] ──────────────────────────────
+    print("\n[CHECK 3] Reward is float in [0, 1]...")
+    assert isinstance(env2.reward, float), \
+        f"env.reward should be float, got {type(env2.reward)}: {env2.reward!r}"
+    assert 0.0 <= env2.reward <= 1.0, \
+        f"env.reward should be in [0, 1], got {env2.reward}"
+    import math
+    assert not math.isnan(env2.reward), "env.reward is NaN — grade_episode bug"
+    print(f"  ✅ env.reward = {env2.reward:.4f} (float, in [0,1], not NaN)")
+    # ── Check 4: History uses actual jittered degradation_start_step ────
+    print("\n[CHECK 4] History contains jittered actual_degradation_start (not config constant)...")
+    history = env2._env._internal.history
+    assert len(history) > 0, "History should not be empty after episode"
+    # Read degradation_start_step from step_info (written by environment.py)
+    step_info_degrade_start = history[0].get("degradation_start_step")
+    # Read the actual jittered value from internal state
+    actual_jittered_start = env2._env._internal.actual_degradation_start
+    # Config constant for hard_multi
+    config_constant = HARD_MULTI.degradation_start_step  # = 0
+    print(f"  Config constant (degradation_start_step): {config_constant}")
+    print(f"  step_info[degradation_start_step]: {step_info_degrade_start}")
+    print(f"  internal.actual_degradation_start: {actual_jittered_start}")
+    assert step_info_degrade_start is not None, \
+        "step_info missing degradation_start_step — grade_episode() will break"
+    assert step_info_degrade_start == actual_jittered_start, \
+        (f"step_info uses wrong degradation onset! "
+         f"Got {step_info_degrade_start}, expected {actual_jittered_start}. "
+         f"This would corrupt adaptation scores in grade_episode().")
+    print(f"  ✅ Jittered onset correctly propagated through step_info.")
+    # ── Check 5: grade_episode() on history returns consistent score ─────
+    print("\n[CHECK 5] grade_episode(history) matches env.reward...")
+    grader_result = grade_episode(history)
+    assert isinstance(grader_result, dict), "grade_episode should return dict"
+    grader_score = float(grader_result["overall_score"])
+    assert abs(grader_score - env2.reward) < 1e-6, \
+        f"env.reward ({env2.reward}) != grade_episode score ({grader_score}). Mismatch."
+    print(f"  ✅ grade_episode overall_score = {grader_score:.4f}, env.reward = {env2.reward:.4f}. Match confirmed.")
+    # ── Check 6: 10-episode reward trajectory ────────────────────────────
+    print("\n[CHECK 6] 10-episode reward trajectory (hard_multi, varying seeds)...")
+    print("  Episode | Seed | Steps | Score | Reward-in-range")
+    rewards = []
+    for ep, seed in enumerate(range(10)):
+        env3 = BudgetRouterGRPOEnv()
+        env3.reset(scenario="hard_multi", seed=seed)
+        done = False
+        steps = 0
+        while not done and steps < 30:
+            # Alternate actions: A, B, A, B... (simple test policy)
+            action_fn = env3.route_to_a if steps % 2 == 0 else env3.route_to_b
+            try:
+                action_fn()
+                steps += 1
+            except ValueError:
+                done = True
+        reward = env3.reward
+        rewards.append(reward)
+        in_range = "✅" if 0.0 <= reward <= 1.0 else "❌"
+        print(f"  Ep {ep+1:02d}     | {seed:4d} | {steps:5d} | {reward:.4f} | {in_range}")
+    import statistics
+    if len(rewards) > 1:
+        std = statistics.stdev(rewards)
+        mean = statistics.mean(rewards)
+        print(f"\n  Mean reward: {mean:.4f} | Std: {std:.4f}")
+        if std < 0.03:
+            print(
+                f"  ⚠️  WARNING: Low reward variance (std={std:.4f}). GRPO may get weak gradient signal.\n"
+                f"     Mitigation: Use num_generations=8, hard_multi scenario, and a small LLM\n"
+                f"     at initialization that makes diverse routing decisions."
+            )
+        else:
+            print(f"  ✅ Reward variance is sufficient for GRPO learning (std={std:.4f} > 0.03).")
+    print("\n" + "=" * 60)
+    print("✅ ALL CHECKS PASSED — BudgetRouterGRPOEnv is ready for GRPO training.")
+    print("=" * 60)
+    print("\nRecommended training config (Mac MPS / Colab):")
+    print("  scenario: hard_multi")
+    print("  num_generations: 8")
+    print("  model: Qwen2.5-1.5B (Mac 16GB) / Qwen2.5-7B (Colab T4)")
+    print("  Mac: TRL + PyTorch MPS (set PYTORCH_ENABLE_MPS_FALLBACK=1)")
+    print("  Colab: Unsloth + vLLM on NVIDIA T4/A100")
+if __name__ == "__main__":
+    main()

client.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from budget_router.client import BudgetRouterClient
2	+
3	+ __all__ = ["BudgetRouterClient"]

eval/eval_all.py ADDED Viewed

	@@ -0,0 +1,306 @@

+#!/usr/bin/env python3
+"""
+eval_all.py — Budget Router Consolidated Evaluator
+====================================================
+Runs heuristic + LLM (+ optional PPO) across all tasks and seeds.
+Outputs a Markdown table + per-episode JSON to outputs/.
+Usage:
+    # Quick (3 seeds, heuristic + LLM):
+    uv run python eval_all.py
+    # Full (10 seeds, all policies):
+    uv run python eval_all.py --seeds 10 --policies heuristic llm
+    # Heuristic only (no API needed):
+    uv run python eval_all.py --policies heuristic
+    # Specific tasks:
+    uv run python eval_all.py --tasks hard hard_multi --seeds 5
+    # Explicit fresh seed bucket:
+    uv run python eval_all.py --tasks hard_multi --seed-values "200,201,202"
+Prerequisites:
+    export HF_TOKEN=<your_hf_token>          # required for LLM policy
+    export API_BASE_URL=https://router.huggingface.co/v1  # default
+    export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct           # default
+Output:
+    outputs/eval_results_<timestamp>.json    — full per-episode data
+    outputs/eval_summary_<timestamp>.md      — markdown table for README
+"""
+import json
+import os
+import sys
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List, Optional
+import typer
+# ── Add parent to path so we can import budget_router ──────────────────────
+sys.path.insert(0, str(Path(__file__).parent))
+from budget_router.environment import BudgetRouterEnv
+from budget_router.models import Action, ActionType, Observation, TaskConfig
+from budget_router.policies import heuristic_baseline_policy
+from budget_router.reward import episode_metrics, grade_episode
+from budget_router.tasks import EASY, HARD, HARD_MULTI, MEDIUM
+from inference import LLMRouter
+# ── Config ──────────────────────────────────────────────────────────────────
+TASKS: Dict[str, TaskConfig] = {
+    "easy":       EASY,
+    "medium":     MEDIUM,
+    "hard":       HARD,
+    "hard_multi": HARD_MULTI,
+}
+SEED_SETS: Dict[str, List[int]] = {
+    "dev":    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
+    "heldout": [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
+}
+API_KEY      = os.getenv("API_KEY") or os.getenv("HF_TOKEN")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME   = os.getenv("MODEL_NAME",   "Qwen/Qwen2.5-72B-Instruct")
+LLM_LOG_RAW = (os.getenv("LLM_LOG_RAW") or "").strip().lower() in {"1", "true", "yes", "y", "on"}
+LLM_LOG_RAW_MAX_CHARS = int(os.getenv("LLM_LOG_RAW_MAX_CHARS") or "220")
+def select_seeds(seed_set: str, seeds: int, seed_values: Optional[str] = None) -> List[int]:
+    """Resolve either a named seed set or an explicit comma/space-separated seed list."""
+    if seed_values is not None:
+        parsed = [int(part) for part in seed_values.replace(",", " ").split()]
+        if not parsed:
+            raise ValueError("No explicit seeds provided in --seed-values")
+        return parsed
+    if seed_set not in SEED_SETS:
+        raise ValueError(f"Unknown seed set: {seed_set}. Choose from: {list(SEED_SETS)}")
+    named_seeds = SEED_SETS[seed_set]
+    return named_seeds[:max(1, min(seeds, len(named_seeds)))]
+def _single_line(value: str | None) -> str:
+    if not value:
+        return "null"
+    return str(value).replace("\n", " ").replace("\r", " ")
+def _truncate(value: str | None, max_chars: int) -> str:
+    s = _single_line(value).strip()
+    if len(s) <= max_chars:
+        return s
+    return s[: max(0, max_chars - 3)] + "..."
+# ── Policies ────────────────────────────────────────────────────────────────
+def _llm_choose_action(policy: LLMRouter, obs: Observation) -> str:
+    action = policy.choose_action(obs)
+    return action.action_type.value
+def _heuristic(obs: Observation) -> str:
+    return heuristic_baseline_policy(obs).action_type.value
+# ── Episode runner ───────────────────────────────────────────────────────────
+def run_one_episode(
+    task_name: str,
+    task_cfg: TaskConfig,
+    seed: int,
+    policy_name: str,
+    policy,  # callable or LLMPolicy
+) -> Dict:
+    env = BudgetRouterEnv()
+    if policy_name == "llm":
+        policy.reset(task_name=task_name)
+    obs = env.reset(seed=seed, scenario=task_cfg)
+    rewards = []
+    actions = []
+    while not obs.done:
+        if policy_name == "heuristic":
+            action_str = _heuristic(obs)
+        else:
+            action_str = _llm_choose_action(policy, obs)
+        obs = env.step(Action(action_type=ActionType(action_str)))
+        reward = float(obs.reward or 0.0)
+        rewards.append(reward)
+        actions.append(action_str)
+        if policy_name == "llm" and LLM_LOG_RAW:
+            llm_raw = getattr(policy, "last_raw_output", None)
+            llm_parsed = getattr(policy, "last_parsed_action", None)
+            typer.echo(
+                f"[LLM] step={env._internal.current_step} action={action_str} "
+                f"reward={reward:+.2f} llm_raw={_truncate(llm_raw, max(20, LLM_LOG_RAW_MAX_CHARS))} "
+                f"llm_parsed={_single_line(llm_parsed)}"
+            )
+    grader = grade_episode(env._internal.history)
+    metrics = episode_metrics(env._internal.history)
+    return {
+        "task":          task_name,
+        "seed":          seed,
+        "policy":        policy_name,
+        "total_reward":  round(sum(rewards), 4),
+        "grader_score":  round(grader["overall_score"], 4),
+        "success_score": round(grader["success_score"], 4),
+        "budget_score":  round(grader["budget_score"], 4),
+        "adaptation_score": round(grader["adaptation_score"], 4),
+        "latency_score": round(grader["latency_score"], 4),
+        "sla_score":     round(grader["sla_score"], 4),
+        "success_rate":  round(metrics["success_rate"], 4),
+        "steps":         len(rewards),
+        "actions":       actions,
+        "rewards":       rewards,
+    }
+# ── Summary helpers ──────────────────────────────────────────────────────────
+def _mean(vals: List[float]) -> float:
+    return round(sum(vals) / len(vals), 4) if vals else 0.0
+def build_summary(results: List[Dict]) -> Dict:
+    summary = {}
+    for r in results:
+        key = (r["task"], r["policy"])
+        summary.setdefault(key, []).append(r)
+    return {
+        f"{task}|{pol}": {
+            "grader_mean":   _mean([e["grader_score"] for e in eps]),
+            "reward_mean":   _mean([e["total_reward"] for e in eps]),
+            "success_rate":  _mean([e["success_rate"] for e in eps]),
+            "adaptation":    _mean([e["adaptation_score"] for e in eps]),
+            "n":             len(eps),
+        }
+        for (task, pol), eps in summary.items()
+    }
+def render_markdown_table(summary: Dict, policies: List[str], tasks: List[str]) -> str:
+    task_labels = {"easy": "Easy", "medium": "Medium", "hard": "Hard", "hard_multi": "Hard_Multi"}
+    pol_headers = " | ".join(f"{p.upper()} Grader" for p in policies)
+    lines = [
+        f"| Task | {pol_headers} | Notes |",
+        "|" + "---|" * (len(policies) + 2),
+    ]
+    for task in tasks:
+        scores = []
+        for p in policies:
+            key = f"{task}|{p}"
+            s = summary.get(key, {})
+            if s:
+                n = s["n"]
+                scores.append(f"{s['grader_mean']:.4f} (n={n})")
+            else:
+                scores.append("—")
+        note = ""
+        if task == "hard_multi" and len(policies) >= 2:
+            k0 = f"{task}|{policies[0]}"
+            k1 = f"{task}|{policies[1]}"
+            if k0 in summary and k1 in summary:
+                diff = summary[k1]["grader_mean"] - summary[k0]["grader_mean"]
+                if diff > 0:
+                    note = f"LLM +{diff*100:.1f} points vs heuristic"
+        line = f"| {task_labels.get(task, task)} | {' | '.join(scores)} | {note} |"
+        lines.append(line)
+    return "\n".join(lines)
+# ── CLI ──────────────────────────────────────────────────────────────────────
+app = typer.Typer(add_completion=False)
+@app.command()
+def main(
+    policies: List[str] = typer.Option(["heuristic", "llm"], help="Policies to run"),
+    tasks:    List[str] = typer.Option(["easy", "medium", "hard", "hard_multi"], help="Tasks"),
+    seeds:    int       = typer.Option(3, help="Number of dev seeds (1-10, costs scale with LLM)"),
+    seed_set: str       = typer.Option("dev", help="Seed set: dev | heldout"),
+    seed_values: Optional[str] = typer.Option(None, help="Explicit comma/space-separated seeds; overrides --seed-set/--seeds"),
+    out_dir:  Path      = typer.Option(Path("outputs"), help="Output directory"),
+) -> None:
+    """Run Budget Router evaluation across policies, tasks, and seeds."""
+    out_dir.mkdir(parents=True, exist_ok=True)
+    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
+    try:
+        selected_seeds = select_seeds(seed_set=seed_set, seeds=seeds, seed_values=seed_values)
+    except ValueError as e:
+        typer.echo(str(e), err=True)
+        raise typer.Exit(1) from e
+    selected_tasks = {t: TASKS[t] for t in tasks if t in TASKS}
+    if not selected_tasks:
+        typer.echo(f"No valid tasks. Choose from: {list(TASKS)}", err=True)
+        raise typer.Exit(1)
+    # Build policy instances
+    policy_instances = {}
+    for p in policies:
+        if p == "heuristic":
+            policy_instances["heuristic"] = None  # uses _heuristic() directly
+        elif p == "llm":
+            try:
+                if not API_KEY:
+                    raise RuntimeError("No API key found. Set HF_TOKEN or API_KEY env var.")
+                policy_instances["llm"] = LLMRouter(
+                    api_base_url=API_BASE_URL, model_name=MODEL_NAME, api_key=API_KEY
+                )
+                typer.echo(f"LLM policy: {MODEL_NAME} via {API_BASE_URL}")
+            except RuntimeError as e:
+                typer.echo(f"[WARN] LLM policy unavailable: {e} — skipping", err=True)
+        elif p == "ppo":
+            typer.echo("[WARN] PPO eval not yet wired in this script — run your train_ppo.py separately", err=True)
+    all_results = []
+    total_episodes = len(policy_instances) * len(selected_tasks) * len(selected_seeds)
+    done = 0
+    for pol_name, pol_obj in policy_instances.items():
+        for task_name, task_cfg in selected_tasks.items():
+            for seed in selected_seeds:
+                typer.echo(f"[{done+1}/{total_episodes}] {pol_name:10s} | {task_name:12s} | seed={seed} ...", nl=False)
+                try:
+                    result = run_one_episode(task_name, task_cfg, seed, pol_name, pol_obj)
+                    all_results.append(result)
+                    typer.echo(f" grader={result['grader_score']:.4f}  reward={result['total_reward']:+.2f}")
+                except Exception as e:
+                    typer.echo(f" ERROR: {e}", err=True)
+                done += 1
+    if not all_results:
+        typer.echo("No results produced.", err=True)
+        raise typer.Exit(1)
+    # Save JSON
+    json_path = out_dir / f"eval_results_{ts}.json"
+    summary = build_summary(all_results)
+    output = {"metadata": {"timestamp": ts, "policies": policies, "tasks": tasks, "seeds": selected_seeds}, "summary": summary, "episodes": all_results}
+    json_path.write_text(json.dumps(output, indent=2))
+    typer.echo(f"\nResults saved to {json_path}")
+    # Save markdown table
+    md_table = render_markdown_table(summary, list(policy_instances.keys()), list(selected_tasks.keys()))
+    md_path = out_dir / f"eval_summary_{ts}.md"
+    md_path.write_text(f"# Budget Router Evaluation — {ts}\n\n{md_table}\n")
+    typer.echo(f"Markdown table saved to {md_path}")
+    typer.echo(f"\n{md_table}")
+if __name__ == "__main__":
+    app()

eval/eval_all.sh ADDED Viewed

	@@ -0,0 +1,116 @@

+#!/usr/bin/env bash
+# eval_all.sh — Budget Router Evaluator Wrapper
+# ==============================================
+# Runs heuristic + LLM eval and saves results to outputs/.
+#
+# Usage:
+#   chmod +x eval_all.sh
+#   ./eval_all.sh                      # quick: 3 seeds, heuristic + LLM
+#   ./eval_all.sh --seeds 10           # full dev set
+#   ./eval_all.sh --policies heuristic # no LLM (no API needed)
+#   ./eval_all.sh --tasks hard hard_multi --seeds 5
+#
+# Prerequisites:
+#   export HF_TOKEN=<your_huggingface_token>
+#   export API_BASE_URL=https://router.huggingface.co/v1  (default)
+#   export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct           (default)
+#   uv or pip install -e . (to install budget_router package)
+#
+# Outputs (in outputs/ directory):
+#   eval_results_<timestamp>.json    — full per-episode grader breakdown
+#   eval_summary_<timestamp>.md      — markdown table ready for README
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+# ── Defaults ────────────────────────────────────────────────────────────────
+SEEDS=3
+POLICIES="heuristic llm"
+TASKS="easy medium hard hard_multi"
+SEED_SET="dev"
+OUT_DIR="$REPO_ROOT/outputs"
+EXTRA_ARGS=()
+# ── Parse CLI args ──────────────────────────────────────────────────────────
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --seeds)      SEEDS="$2";    shift 2 ;;
+        --seed-set)   SEED_SET="$2"; shift 2 ;;
+        --out-dir)    OUT_DIR="$2";  shift 2 ;;
+        --policies)
+            POLICIES=""
+            shift
+            while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
+                POLICIES="$POLICIES $1"; shift
+            done
+            ;;
+        --tasks)
+            TASKS=""
+            shift
+            while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
+                TASKS="$TASKS $1"; shift
+            done
+            ;;
+        *) EXTRA_ARGS+=("$1"); shift ;;
+    esac
+done
+# ── Validate environment ─────────────────────────────────────────────────────
+echo ""
+echo "╔══════════════════════════════════════════════╗"
+echo "║   Budget Router Evaluator                   ║"
+echo "╚══════════════════════════════════════════════╝"
+echo ""
+echo "Config:"
+echo "  Policies:  $POLICIES"
+echo "  Tasks:     $TASKS"
+echo "  Seeds:     $SEEDS (seed_set=$SEED_SET)"
+echo "  Output:    $OUT_DIR/"
+echo ""
+# Check HF_TOKEN if LLM in policies
+if echo "$POLICIES" | grep -q "llm"; then
+    if [[ -z "${HF_TOKEN:-}" && -z "${API_KEY:-}" ]]; then
+        echo "⚠️  WARNING: HF_TOKEN and API_KEY not set."
+        echo "   LLM policy will be skipped. Set HF_TOKEN to enable."
+        echo ""
+    else
+        TOKEN_PREVIEW="${HF_TOKEN:-${API_KEY:-}}"
+        echo "  API key:   ${TOKEN_PREVIEW:0:8}... (${#TOKEN_PREVIEW} chars)"
+        echo "  Model:     ${MODEL_NAME:-Qwen/Qwen2.5-72B-Instruct}"
+        echo "  Endpoint:  ${API_BASE_URL:-https://router.huggingface.co/v1}"
+        echo ""
+    fi
+fi
+# ── Build typer args ─────────────────────────────────────────────────────────
+TYPER_ARGS=(
+    "--seeds" "$SEEDS"
+    "--seed-set" "$SEED_SET"
+    "--out-dir" "$OUT_DIR"
+)
+for p in $POLICIES; do
+    TYPER_ARGS+=("--policies" "$p")
+done
+for t in $TASKS; do
+    TYPER_ARGS+=("--tasks" "$t")
+done
+# ── Run ──────────────────────────────────────────────────────────────────────
+cd "$SCRIPT_DIR"
+if command -v uv &>/dev/null; then
+    uv run python eval_all.py "${TYPER_ARGS[@]}" "${EXTRA_ARGS[@]+"${EXTRA_ARGS[@]}"}"
+elif command -v python3 &>/dev/null; then
+    python3 eval_all.py "${TYPER_ARGS[@]}" "${EXTRA_ARGS[@]+"${EXTRA_ARGS[@]}"}"
+else
+    echo "Error: neither uv nor python3 found." >&2
+    exit 1
+fi
+echo ""
+echo "✅ Evaluation complete. Results in $OUT_DIR/"

eval/outputs/prompt_audit/belief_v1_dev10/eval_results_20260425_160429.json ADDED Viewed

	@@ -0,0 +1,1188 @@

+{
+  "metadata": {
+    "timestamp": "20260425_160429",
+    "policies": [
+      "heuristic",
+      "llm"
+    ],
+    "tasks": [
+      "hard_multi"
+    ],
+    "seeds": [
+      0,
+      1,
+      2,
+      3,
+      4,
+      5,
+      6,
+      7,
+      8,
+      9
+    ]
+  },
+  "summary": {
+    "hard_multi|heuristic": {
+      "grader_mean": 0.6078,
+      "reward_mean": -2.9709,
+      "success_rate": 0.6998,
+      "adaptation": 0.6907,
+      "n": 10
+    },
+    "hard_multi|llm": {
+      "grader_mean": 0.6218,
+      "reward_mean": 1.3455,
+      "success_rate": 0.8535,
+      "adaptation": 0.8635,
+      "n": 10
+    }
+  },
+  "episodes": [
+    {
+      "task": "hard_multi",
+      "seed": 0,
+      "policy": "heuristic",
+      "total_reward": -4.4659,
+      "grader_score": 0.5569,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.6032,
+      "latency_score": 0.4686,
+      "sla_score": 0.9474,
+      "success_rate": 0.6842,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.3750364951788474,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 1,
+      "policy": "heuristic",
+      "total_reward": -2.7727,
+      "grader_score": 0.6077,
+      "success_score": 0.7,
+      "budget_score": 0.0455,
+      "adaptation_score": 0.6833,
+      "latency_score": 0.5213,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 2,
+      "policy": "heuristic",
+      "total_reward": -2.0,
+      "grader_score": 0.6165,
+      "success_score": 0.7,
+      "budget_score": 0.2,
+      "adaptation_score": 0.6357,
+      "latency_score": 0.4967,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 3,
+      "policy": "heuristic",
+      "total_reward": -1.9895,
+      "grader_score": 0.6289,
+      "success_score": 0.7,
+      "budget_score": 0.2091,
+      "adaptation_score": 0.6833,
+      "latency_score": 0.5416,
+      "sla_score": 0.95,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.262190038025986,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 4,
+      "policy": "heuristic",
+      "total_reward": -4.0909,
+      "grader_score": 0.5933,
+      "success_score": 0.65,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.6625,
+      "latency_score": 0.5175,
+      "sla_score": 1.0,
+      "success_rate": 0.6842,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 5,
+      "policy": "heuristic",
+      "total_reward": -1.4024,
+      "grader_score": 0.607,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.8125,
+      "latency_score": 0.5142,
+      "sla_score": 0.9412,
+      "success_rate": 0.7647,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.311463428136077,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 6,
+      "policy": "heuristic",
+      "total_reward": -3.7273,
+      "grader_score": 0.6546,
+      "success_score": 0.65,
+      "budget_score": 0.4545,
+      "adaptation_score": 0.6458,
+      "latency_score": 0.5611,
+      "sla_score": 1.0,
+      "success_rate": 0.65,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 7,
+      "policy": "heuristic",
+      "total_reward": 0.1818,
+      "grader_score": 0.6477,
+      "success_score": 0.7,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.85,
+      "latency_score": 0.5613,
+      "sla_score": 1.0,
+      "success_rate": 0.7778,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 8,
+      "policy": "heuristic",
+      "total_reward": -8.3509,
+      "grader_score": 0.5338,
+      "success_score": 0.6,
+      "budget_score": 0.2,
+      "adaptation_score": 0.5682,
+      "latency_score": 0.4135,
+      "sla_score": 0.85,
+      "success_rate": 0.6,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        -2.1359667972034475,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.4320645744998868,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2828540896362535,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 9,
+      "policy": "heuristic",
+      "total_reward": -1.0909,
+      "grader_score": 0.6315,
+      "success_score": 0.7,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.7625,
+      "latency_score": 0.5336,
+      "sla_score": 1.0,
+      "success_rate": 0.7368,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 0,
+      "policy": "llm",
+      "total_reward": -3.7273,
+      "grader_score": 0.5176,
+      "success_score": 0.8333,
+      "budget_score": 0.0,
+      "adaptation_score": 0.7946,
+      "latency_score": 0.656,
+      "sla_score": 1.0,
+      "success_rate": 0.8333,
+      "steps": 18,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -10.0
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 1,
+      "policy": "llm",
+      "total_reward": 6.1364,
+      "grader_score": 0.6994,
+      "success_score": 0.8,
+      "budget_score": 0.0273,
+      "adaptation_score": 0.8786,
+      "latency_score": 0.648,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 2,
+      "policy": "llm",
+      "total_reward": -1.5455,
+      "grader_score": 0.5204,
+      "success_score": 0.85,
+      "budget_score": 0.0,
+      "adaptation_score": 0.8071,
+      "latency_score": 0.6372,
+      "sla_score": 1.0,
+      "success_rate": 0.85,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -10.0
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 3,
+      "policy": "llm",
+      "total_reward": 9.0455,
+      "grader_score": 0.7388,
+      "success_score": 0.9,
+      "budget_score": 0.0091,
+      "adaptation_score": 0.8944,
+      "latency_score": 0.6926,
+      "sla_score": 1.0,
+      "success_rate": 0.9,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 4,
+      "policy": "llm",
+      "total_reward": -9.6364,
+      "grader_score": 0.4732,
+      "success_score": 0.7222,
+      "budget_score": 0.0,
+      "adaptation_score": 0.7083,
+      "latency_score": 0.6132,
+      "sla_score": 1.0,
+      "success_rate": 0.7222,
+      "steps": 18,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -10.0
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 5,
+      "policy": "llm",
+      "total_reward": 1.9091,
+      "grader_score": 0.6665,
+      "success_score": 0.65,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.9375,
+      "latency_score": 0.6085,
+      "sla_score": 1.0,
+      "success_rate": 0.8667,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 6,
+      "policy": "llm",
+      "total_reward": 9.3182,
+      "grader_score": 0.7535,
+      "success_score": 0.9,
+      "budget_score": 0.0636,
+      "adaptation_score": 0.9444,
+      "latency_score": 0.6755,
+      "sla_score": 1.0,
+      "success_rate": 0.9,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 7,
+      "policy": "llm",
+      "total_reward": 3.1818,
+      "grader_score": 0.673,
+      "success_score": 0.7,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.9375,
+      "latency_score": 0.6004,
+      "sla_score": 1.0,
+      "success_rate": 0.875,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 8,
+      "policy": "llm",
+      "total_reward": 3.3636,
+      "grader_score": 0.6573,
+      "success_score": 0.7,
+      "budget_score": 0.0727,
+      "adaptation_score": 0.8661,
+      "latency_score": 0.5661,
+      "sla_score": 1.0,
+      "success_rate": 0.875,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 9,
+      "policy": "llm",
+      "total_reward": -4.5909,
+      "grader_score": 0.5185,
+      "success_score": 0.8235,
+      "budget_score": 0.0,
+      "adaptation_score": 0.8667,
+      "latency_score": 0.605,
+      "sla_score": 1.0,
+      "success_rate": 0.8235,
+      "steps": 17,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -10.0
+      ]
+    }
+  ]
+}

eval/outputs/prompt_audit/belief_v1_dev10/eval_summary_20260425_160429.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Budget Router Evaluation — 20260425_160429
+| Task | HEURISTIC Grader | LLM Grader | Notes |
+|---|---|---|---|
+| Hard_Multi | 0.6078 (n=10) | 0.6218 (n=10) | LLM +1.4 points vs heuristic |

eval/outputs/prompt_audit/belief_v1_heldout5/eval_results_20260425_160016.json ADDED Viewed

	@@ -0,0 +1,615 @@

+{
+  "metadata": {
+    "timestamp": "20260425_160016",
+    "policies": [
+      "heuristic",
+      "llm"
+    ],
+    "tasks": [
+      "hard_multi"
+    ],
+    "seeds": [
+      100,
+      101,
+      102,
+      103,
+      104
+    ]
+  },
+  "summary": {
+    "hard_multi|heuristic": {
+      "grader_mean": 0.6175,
+      "reward_mean": -2.1399,
+      "success_rate": 0.7108,
+      "adaptation": 0.7001,
+      "n": 5
+    },
+    "hard_multi|llm": {
+      "grader_mean": 0.6297,
+      "reward_mean": 1.4818,
+      "success_rate": 0.8462,
+      "adaptation": 0.8568,
+      "n": 5
+    }
+  },
+  "episodes": [
+    {
+      "task": "hard_multi",
+      "seed": 100,
+      "policy": "heuristic",
+      "total_reward": -7.0629,
+      "grader_score": 0.5459,
+      "success_score": 0.6,
+      "budget_score": 0.0909,
+      "adaptation_score": 0.6111,
+      "latency_score": 0.4399,
+      "sla_score": 0.9474,
+      "success_rate": 0.6316,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2447259319640143,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 101,
+      "policy": "heuristic",
+      "total_reward": 3.4091,
+      "grader_score": 0.6753,
+      "success_score": 0.8,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.7857,
+      "latency_score": 0.5795,
+      "sla_score": 1.0,
+      "success_rate": 0.8,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 102,
+      "policy": "heuristic",
+      "total_reward": -2.5909,
+      "grader_score": 0.6228,
+      "success_score": 0.7,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.6932,
+      "latency_score": 0.5593,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 103,
+      "policy": "heuristic",
+      "total_reward": -2.8182,
+      "grader_score": 0.6003,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.75,
+      "latency_score": 0.4991,
+      "sla_score": 1.0,
+      "success_rate": 0.7222,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 104,
+      "policy": "heuristic",
+      "total_reward": -1.6364,
+      "grader_score": 0.6432,
+      "success_score": 0.7,
+      "budget_score": 0.2727,
+      "adaptation_score": 0.6607,
+      "latency_score": 0.5509,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 100,
+      "policy": "llm",
+      "total_reward": -1.1364,
+      "grader_score": 0.6114,
+      "success_score": 0.55,
+      "budget_score": 0.0727,
+      "adaptation_score": 0.8889,
+      "latency_score": 0.5387,
+      "sla_score": 1.0,
+      "success_rate": 0.8462,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 101,
+      "policy": "llm",
+      "total_reward": -2.5909,
+      "grader_score": 0.5212,
+      "success_score": 0.8421,
+      "budget_score": 0.0,
+      "adaptation_score": 0.8333,
+      "latency_score": 0.6282,
+      "sla_score": 1.0,
+      "success_rate": 0.8421,
+      "steps": 19,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -10.0
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 102,
+      "policy": "llm",
+      "total_reward": 6.0909,
+      "grader_score": 0.707,
+      "success_score": 0.8,
+      "budget_score": 0.0182,
+      "adaptation_score": 0.9091,
+      "latency_score": 0.6621,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 103,
+      "policy": "llm",
+      "total_reward": -1.3182,
+      "grader_score": 0.6135,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.7946,
+      "latency_score": 0.5208,
+      "sla_score": 1.0,
+      "success_rate": 0.7647,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 104,
+      "policy": "llm",
+      "total_reward": 6.3636,
+      "grader_score": 0.6953,
+      "success_score": 0.8,
+      "budget_score": 0.0727,
+      "adaptation_score": 0.8583,
+      "latency_score": 0.6138,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    }
+  ]
+}

eval/outputs/prompt_audit/belief_v1_heldout5/eval_summary_20260425_160016.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Budget Router Evaluation — 20260425_160016
+| Task | HEURISTIC Grader | LLM Grader | Notes |
+|---|---|---|---|
+| Hard_Multi | 0.6175 (n=5) | 0.6297 (n=5) | LLM +1.2 points vs heuristic |

eval/outputs/prompt_audit/budget_guard_alltasks_dev3/eval_results_20260425_165910.json ADDED Viewed

	@@ -0,0 +1,1468 @@

+{
+  "metadata": {
+    "timestamp": "20260425_165910",
+    "policies": [
+      "heuristic",
+      "llm"
+    ],
+    "tasks": [
+      "easy",
+      "medium",
+      "hard",
+      "hard_multi"
+    ],
+    "seeds": [
+      0,
+      1,
+      2
+    ]
+  },
+  "summary": {
+    "easy|heuristic": {
+      "grader_mean": 0.7734,
+      "reward_mean": 9.4667,
+      "success_rate": 0.8833,
+      "adaptation": 0.8833,
+      "n": 3
+    },
+    "medium|heuristic": {
+      "grader_mean": 0.6187,
+      "reward_mean": -1.5088,
+      "success_rate": 0.75,
+      "adaptation": 0.7333,
+      "n": 3
+    },
+    "hard|heuristic": {
+      "grader_mean": 0.5491,
+      "reward_mean": -4.7909,
+      "success_rate": 0.7593,
+      "adaptation": 0.7868,
+      "n": 3
+    },
+    "hard_multi|heuristic": {
+      "grader_mean": 0.5937,
+      "reward_mean": -3.0795,
+      "success_rate": 0.6947,
+      "adaptation": 0.6407,
+      "n": 3
+    },
+    "easy|llm": {
+      "grader_mean": 0.7044,
+      "reward_mean": 7.4833,
+      "success_rate": 0.8952,
+      "adaptation": 0.8167,
+      "n": 3
+    },
+    "medium|llm": {
+      "grader_mean": 0.6559,
+      "reward_mean": 1.1492,
+      "success_rate": 0.8061,
+      "adaptation": 0.8111,
+      "n": 3
+    },
+    "hard|llm": {
+      "grader_mean": 0.6196,
+      "reward_mean": -0.7321,
+      "success_rate": 0.8446,
+      "adaptation": 0.8796,
+      "n": 3
+    },
+    "hard_multi|llm": {
+      "grader_mean": 0.6989,
+      "reward_mean": 6.2879,
+      "success_rate": 0.8887,
+      "adaptation": 0.8675,
+      "n": 3
+    }
+  },
+  "episodes": [
+    {
+      "task": "easy",
+      "seed": 0,
+      "policy": "heuristic",
+      "total_reward": 12.2,
+      "grader_score": 0.7709,
+      "success_score": 0.95,
+      "budget_score": 0.04,
+      "adaptation_score": 0.95,
+      "latency_score": 0.6993,
+      "sla_score": 1.0,
+      "success_rate": 0.95,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        -2.05,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75
+      ]
+    },
+    {
+      "task": "easy",
+      "seed": 1,
+      "policy": "heuristic",
+      "total_reward": 10.0,
+      "grader_score": 0.8422,
+      "success_score": 0.85,
+      "budget_score": 0.8,
+      "adaptation_score": 0.85,
+      "latency_score": 0.7358,
+      "sla_score": 1.0,
+      "success_rate": 0.85,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a"
+      ],
+      "rewards": [
+        0.95,
+        0.95,
+        0.95,
+        -2.05,
+        0.95,
+        0.95,
+        -2.05,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        0.95,
+        -2.05,
+        0.95
+      ]
+    },
+    {
+      "task": "easy",
+      "seed": 2,
+      "policy": "heuristic",
+      "total_reward": 6.2,
+      "grader_score": 0.7071,
+      "success_score": 0.85,
+      "budget_score": 0.04,
+      "adaptation_score": 0.85,
+      "latency_score": 0.6306,
+      "sla_score": 1.0,
+      "success_rate": 0.85,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        -2.05,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        -2.25,
+        -2.25,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75
+      ]
+    },
+    {
+      "task": "medium",
+      "seed": 0,
+      "policy": "heuristic",
+      "total_reward": 1.2105,
+      "grader_score": 0.6776,
+      "success_score": 0.75,
+      "budget_score": 0.2421,
+      "adaptation_score": 0.7333,
+      "latency_score": 0.5979,
+      "sla_score": 1.0,
+      "success_rate": 0.75,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        -2.0526315789473686,
+        -2.0526315789473686,
+        -2.0526315789473686,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        0.7368421052631579,
+        -2.263157894736842
+      ]
+    },
+    {
+      "task": "medium",
+      "seed": 1,
+      "policy": "heuristic",
+      "total_reward": -0.9474,
+      "grader_score": 0.6688,
+      "success_score": 0.7,
+      "budget_score": 0.4105,
+      "adaptation_score": 0.6667,
+      "latency_score": 0.5696,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        -2.0526315789473686,
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        -2.0526315789473686,
+        -2.0526315789473686,
+        -2.0526315789473686,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        0.7368421052631579,
+        -2.263157894736842,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579
+      ]
+    },
+    {
+      "task": "medium",
+      "seed": 2,
+      "policy": "heuristic",
+      "total_reward": -4.7895,
+      "grader_score": 0.5097,
+      "success_score": 0.8,
+      "budget_score": 0.0,
+      "adaptation_score": 0.8,
+      "latency_score": 0.6483,
+      "sla_score": 1.0,
+      "success_rate": 0.8,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        -2.0526315789473686,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -10.0
+      ]
+    },
+    {
+      "task": "hard",
+      "seed": 0,
+      "policy": "heuristic",
+      "total_reward": -7.8824,
+      "grader_score": 0.4999,
+      "success_score": 0.75,
+      "budget_score": 0.0,
+      "adaptation_score": 0.8235,
+      "latency_score": 0.6343,
+      "sla_score": 1.0,
+      "success_rate": 0.75,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9411764705882353,
+        -2.0588235294117645,
+        -2.0588235294117645,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.2941176470588234,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.2941176470588234,
+        -10.0
+      ]
+    },
+    {
+      "task": "hard",
+      "seed": 1,
+      "policy": "heuristic",
+      "total_reward": 0.2941,
+      "grader_score": 0.6506,
+      "success_score": 0.75,
+      "budget_score": 0.0588,
+      "adaptation_score": 0.7368,
+      "latency_score": 0.5973,
+      "sla_score": 1.0,
+      "success_rate": 0.75,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9411764705882353,
+        0.9411764705882353,
+        -2.0588235294117645,
+        -2.0588235294117645,
+        -2.0588235294117645,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.2941176470588234,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.2941176470588234,
+        0.7058823529411764,
+        0.7058823529411764
+      ]
+    },
+    {
+      "task": "hard",
+      "seed": 2,
+      "policy": "heuristic",
+      "total_reward": -6.7845,
+      "grader_score": 0.4969,
+      "success_score": 0.7778,
+      "budget_score": 0.0,
+      "adaptation_score": 0.8,
+      "latency_score": 0.6374,
+      "sla_score": 0.9444,
+      "success_rate": 0.7778,
+      "steps": 18,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        -2.0588235294117645,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.3138982657484135,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.2941176470588234,
+        0.7058823529411764,
+        0.7058823529411764,
+        -10.0
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 0,
+      "policy": "heuristic",
+      "total_reward": -4.4659,
+      "grader_score": 0.5569,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.6032,
+      "latency_score": 0.4686,
+      "sla_score": 0.9474,
+      "success_rate": 0.6842,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.3750364951788474,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 1,
+      "policy": "heuristic",
+      "total_reward": -2.7727,
+      "grader_score": 0.6077,
+      "success_score": 0.7,
+      "budget_score": 0.0455,
+      "adaptation_score": 0.6833,
+      "latency_score": 0.5213,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 2,
+      "policy": "heuristic",
+      "total_reward": -2.0,
+      "grader_score": 0.6165,
+      "success_score": 0.7,
+      "budget_score": 0.2,
+      "adaptation_score": 0.6357,
+      "latency_score": 0.4967,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "easy",
+      "seed": 0,
+      "policy": "llm",
+      "total_reward": 12.2,
+      "grader_score": 0.7709,
+      "success_score": 0.95,
+      "budget_score": 0.04,
+      "adaptation_score": 0.95,
+      "latency_score": 0.6993,
+      "sla_score": 1.0,
+      "success_rate": 0.95,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        -2.05,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75
+      ]
+    },
+    {
+      "task": "easy",
+      "seed": 1,
+      "policy": "llm",
+      "total_reward": 12.8,
+      "grader_score": 0.7902,
+      "success_score": 0.95,
+      "budget_score": 0.16,
+      "adaptation_score": 0.95,
+      "latency_score": 0.7058,
+      "sla_score": 1.0,
+      "success_rate": 0.95,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.95,
+        0.95,
+        0.95,
+        -2.05,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75
+      ]
+    },
+    {
+      "task": "easy",
+      "seed": 2,
+      "policy": "llm",
+      "total_reward": -2.55,
+      "grader_score": 0.552,
+      "success_score": 0.55,
+      "budget_score": 0.09,
+      "adaptation_score": 0.55,
+      "latency_score": 0.5677,
+      "sla_score": 1.0,
+      "success_rate": 0.7857,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.05,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        0.75,
+        -2.25,
+        -2.25,
+        0.5,
+        0.5,
+        0.5,
+        0.5,
+        0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "medium",
+      "seed": 0,
+      "policy": "llm",
+      "total_reward": 9.2632,
+      "grader_score": 0.7476,
+      "success_score": 0.9,
+      "budget_score": 0.0526,
+      "adaptation_score": 0.9333,
+      "latency_score": 0.6651,
+      "sla_score": 1.0,
+      "success_rate": 0.9,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        -2.0526315789473686,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        0.4736842105263157,
+        0.4736842105263157
+      ]
+    },
+    {
+      "task": "medium",
+      "seed": 1,
+      "policy": "llm",
+      "total_reward": -2.5789,
+      "grader_score": 0.6148,
+      "success_score": 0.7,
+      "budget_score": 0.0842,
+      "adaptation_score": 0.6667,
+      "latency_score": 0.5439,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_a",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        0.9473684210526315,
+        -2.0526315789473686,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        -2.0526315789473686,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        -2.0526315789473686,
+        -2.263157894736842,
+        0.4736842105263157,
+        0.4736842105263157,
+        0.4736842105263157
+      ]
+    },
+    {
+      "task": "medium",
+      "seed": 2,
+      "policy": "llm",
+      "total_reward": -3.2368,
+      "grader_score": 0.6052,
+      "success_score": 0.45,
+      "budget_score": 0.2526,
+      "adaptation_score": 0.8333,
+      "latency_score": 0.5783,
+      "sla_score": 1.0,
+      "success_rate": 0.8182,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0526315789473686,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        0.7368421052631579,
+        -2.263157894736842,
+        0.4736842105263157,
+        0.4736842105263157,
+        0.4736842105263157,
+        0.4736842105263157,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard",
+      "seed": 0,
+      "policy": "llm",
+      "total_reward": -2.8235,
+      "grader_score": 0.5954,
+      "success_score": 0.5,
+      "budget_score": 0.0353,
+      "adaptation_score": 0.8889,
+      "latency_score": 0.5615,
+      "sla_score": 1.0,
+      "success_rate": 0.8333,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9411764705882353,
+        -2.0588235294117645,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.2941176470588234,
+        0.4117647058823529,
+        0.4117647058823529,
+        0.4117647058823529,
+        0.4117647058823529,
+        0.4117647058823529,
+        -0.5,
+        0.4117647058823529,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard",
+      "seed": 1,
+      "policy": "llm",
+      "total_reward": 4.6176,
+      "grader_score": 0.69,
+      "success_score": 0.75,
+      "budget_score": 0.0235,
+      "adaptation_score": 0.875,
+      "latency_score": 0.6823,
+      "sla_score": 1.0,
+      "success_rate": 0.8824,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9411764705882353,
+        0.9411764705882353,
+        -2.0588235294117645,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.2941176470588234,
+        0.4117647058823529,
+        0.4117647058823529,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard",
+      "seed": 2,
+      "policy": "llm",
+      "total_reward": -3.9904,
+      "grader_score": 0.5735,
+      "success_score": 0.45,
+      "budget_score": 0.1059,
+      "adaptation_score": 0.875,
+      "latency_score": 0.556,
+      "sla_score": 0.9091,
+      "success_rate": 0.8182,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0588235294117645,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        0.7058823529411764,
+        -2.3138982657484135,
+        0.4117647058823529,
+        0.4117647058823529,
+        0.4117647058823529,
+        0.4117647058823529,
+        0.4117647058823529,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 0,
+      "policy": "llm",
+      "total_reward": 4.7727,
+      "grader_score": 0.6818,
+      "success_score": 0.75,
+      "budget_score": 0.0545,
+      "adaptation_score": 0.8571,
+      "latency_score": 0.6358,
+      "sla_score": 1.0,
+      "success_rate": 0.8824,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 1,
+      "policy": "llm",
+      "total_reward": 6.1364,
+      "grader_score": 0.6994,
+      "success_score": 0.8,
+      "budget_score": 0.0273,
+      "adaptation_score": 0.8786,
+      "latency_score": 0.648,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 2,
+      "policy": "llm",
+      "total_reward": 7.9545,
+      "grader_score": 0.7156,
+      "success_score": 0.85,
+      "budget_score": 0.0909,
+      "adaptation_score": 0.8667,
+      "latency_score": 0.6181,
+      "sla_score": 1.0,
+      "success_rate": 0.8947,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    }
+  ]
+}

eval/outputs/prompt_audit/budget_guard_alltasks_dev3/eval_summary_20260425_165910.md ADDED Viewed

	@@ -0,0 +1,8 @@

+# Budget Router Evaluation — 20260425_165910
+| Task | HEURISTIC Grader | LLM Grader | Notes |
+|---|---|---|---|
+| Easy | 0.7734 (n=3) | 0.7044 (n=3) |  |
+| Medium | 0.6187 (n=3) | 0.6559 (n=3) |  |
+| Hard | 0.5491 (n=3) | 0.6196 (n=3) |  |
+| Hard_Multi | 0.5937 (n=3) | 0.6989 (n=3) | LLM +10.5 points vs heuristic |

eval/outputs/prompt_audit/budget_guard_dev10/eval_results_20260425_164343.json ADDED Viewed

	@@ -0,0 +1,1202 @@

+{
+  "metadata": {
+    "timestamp": "20260425_164343",
+    "policies": [
+      "heuristic",
+      "llm"
+    ],
+    "tasks": [
+      "hard_multi"
+    ],
+    "seeds": [
+      0,
+      1,
+      2,
+      3,
+      4,
+      5,
+      6,
+      7,
+      8,
+      9
+    ]
+  },
+  "summary": {
+    "hard_multi|heuristic": {
+      "grader_mean": 0.6078,
+      "reward_mean": -2.9709,
+      "success_rate": 0.6998,
+      "adaptation": 0.6907,
+      "n": 10
+    },
+    "hard_multi|llm": {
+      "grader_mean": 0.6888,
+      "reward_mean": 4.7955,
+      "success_rate": 0.8722,
+      "adaptation": 0.8882,
+      "n": 10
+    }
+  },
+  "episodes": [
+    {
+      "task": "hard_multi",
+      "seed": 0,
+      "policy": "heuristic",
+      "total_reward": -4.4659,
+      "grader_score": 0.5569,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.6032,
+      "latency_score": 0.4686,
+      "sla_score": 0.9474,
+      "success_rate": 0.6842,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.3750364951788474,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 1,
+      "policy": "heuristic",
+      "total_reward": -2.7727,
+      "grader_score": 0.6077,
+      "success_score": 0.7,
+      "budget_score": 0.0455,
+      "adaptation_score": 0.6833,
+      "latency_score": 0.5213,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 2,
+      "policy": "heuristic",
+      "total_reward": -2.0,
+      "grader_score": 0.6165,
+      "success_score": 0.7,
+      "budget_score": 0.2,
+      "adaptation_score": 0.6357,
+      "latency_score": 0.4967,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 3,
+      "policy": "heuristic",
+      "total_reward": -1.9895,
+      "grader_score": 0.6289,
+      "success_score": 0.7,
+      "budget_score": 0.2091,
+      "adaptation_score": 0.6833,
+      "latency_score": 0.5416,
+      "sla_score": 0.95,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.262190038025986,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 4,
+      "policy": "heuristic",
+      "total_reward": -4.0909,
+      "grader_score": 0.5933,
+      "success_score": 0.65,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.6625,
+      "latency_score": 0.5175,
+      "sla_score": 1.0,
+      "success_rate": 0.6842,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 5,
+      "policy": "heuristic",
+      "total_reward": -1.4024,
+      "grader_score": 0.607,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.8125,
+      "latency_score": 0.5142,
+      "sla_score": 0.9412,
+      "success_rate": 0.7647,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.311463428136077,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 6,
+      "policy": "heuristic",
+      "total_reward": -3.7273,
+      "grader_score": 0.6546,
+      "success_score": 0.65,
+      "budget_score": 0.4545,
+      "adaptation_score": 0.6458,
+      "latency_score": 0.5611,
+      "sla_score": 1.0,
+      "success_rate": 0.65,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 7,
+      "policy": "heuristic",
+      "total_reward": 0.1818,
+      "grader_score": 0.6477,
+      "success_score": 0.7,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.85,
+      "latency_score": 0.5613,
+      "sla_score": 1.0,
+      "success_rate": 0.7778,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 8,
+      "policy": "heuristic",
+      "total_reward": -8.3509,
+      "grader_score": 0.5338,
+      "success_score": 0.6,
+      "budget_score": 0.2,
+      "adaptation_score": 0.5682,
+      "latency_score": 0.4135,
+      "sla_score": 0.85,
+      "success_rate": 0.6,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        -2.1359667972034475,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.4320645744998868,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2828540896362535,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 9,
+      "policy": "heuristic",
+      "total_reward": -1.0909,
+      "grader_score": 0.6315,
+      "success_score": 0.7,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.7625,
+      "latency_score": 0.5336,
+      "sla_score": 1.0,
+      "success_rate": 0.7368,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 0,
+      "policy": "llm",
+      "total_reward": 4.7727,
+      "grader_score": 0.6818,
+      "success_score": 0.75,
+      "budget_score": 0.0545,
+      "adaptation_score": 0.8571,
+      "latency_score": 0.6358,
+      "sla_score": 1.0,
+      "success_rate": 0.8824,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 1,
+      "policy": "llm",
+      "total_reward": 6.1364,
+      "grader_score": 0.6994,
+      "success_score": 0.8,
+      "budget_score": 0.0273,
+      "adaptation_score": 0.8786,
+      "latency_score": 0.648,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 2,
+      "policy": "llm",
+      "total_reward": 7.9545,
+      "grader_score": 0.7156,
+      "success_score": 0.85,
+      "budget_score": 0.0909,
+      "adaptation_score": 0.8667,
+      "latency_score": 0.6181,
+      "sla_score": 1.0,
+      "success_rate": 0.8947,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 3,
+      "policy": "llm",
+      "total_reward": 9.0455,
+      "grader_score": 0.7388,
+      "success_score": 0.9,
+      "budget_score": 0.0091,
+      "adaptation_score": 0.8944,
+      "latency_score": 0.6926,
+      "sla_score": 1.0,
+      "success_rate": 0.9,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 4,
+      "policy": "llm",
+      "total_reward": -1.1364,
+      "grader_score": 0.624,
+      "success_score": 0.65,
+      "budget_score": 0.0727,
+      "adaptation_score": 0.75,
+      "latency_score": 0.5904,
+      "sla_score": 1.0,
+      "success_rate": 0.7647,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 5,
+      "policy": "llm",
+      "total_reward": 1.9091,
+      "grader_score": 0.6665,
+      "success_score": 0.65,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.9375,
+      "latency_score": 0.6085,
+      "sla_score": 1.0,
+      "success_rate": 0.8667,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 6,
+      "policy": "llm",
+      "total_reward": 9.3182,
+      "grader_score": 0.7535,
+      "success_score": 0.9,
+      "budget_score": 0.0636,
+      "adaptation_score": 0.9444,
+      "latency_score": 0.6755,
+      "sla_score": 1.0,
+      "success_rate": 0.9,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 7,
+      "policy": "llm",
+      "total_reward": 3.1818,
+      "grader_score": 0.673,
+      "success_score": 0.7,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.9375,
+      "latency_score": 0.6004,
+      "sla_score": 1.0,
+      "success_rate": 0.875,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 8,
+      "policy": "llm",
+      "total_reward": 3.3636,
+      "grader_score": 0.6573,
+      "success_score": 0.7,
+      "budget_score": 0.0727,
+      "adaptation_score": 0.8661,
+      "latency_score": 0.5661,
+      "sla_score": 1.0,
+      "success_rate": 0.875,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 9,
+      "policy": "llm",
+      "total_reward": 3.4091,
+      "grader_score": 0.6783,
+      "success_score": 0.7,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.95,
+      "latency_score": 0.5803,
+      "sla_score": 1.0,
+      "success_rate": 0.875,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    }
+  ]
+}

eval/outputs/prompt_audit/budget_guard_dev10/eval_summary_20260425_164343.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Budget Router Evaluation — 20260425_164343
+| Task | HEURISTIC Grader | LLM Grader | Notes |
+|---|---|---|---|
+| Hard_Multi | 0.6078 (n=10) | 0.6888 (n=10) | LLM +8.1 points vs heuristic |

eval/outputs/prompt_audit/budget_guard_heldout5/eval_results_20260425_163956.json ADDED Viewed

	@@ -0,0 +1,617 @@

+{
+  "metadata": {
+    "timestamp": "20260425_163956",
+    "policies": [
+      "heuristic",
+      "llm"
+    ],
+    "tasks": [
+      "hard_multi"
+    ],
+    "seeds": [
+      100,
+      101,
+      102,
+      103,
+      104
+    ]
+  },
+  "summary": {
+    "hard_multi|heuristic": {
+      "grader_mean": 0.6175,
+      "reward_mean": -2.1399,
+      "success_rate": 0.7108,
+      "adaptation": 0.7001,
+      "n": 5
+    },
+    "hard_multi|llm": {
+      "grader_mean": 0.6577,
+      "reward_mean": 2.3818,
+      "success_rate": 0.8196,
+      "adaptation": 0.8216,
+      "n": 5
+    }
+  },
+  "episodes": [
+    {
+      "task": "hard_multi",
+      "seed": 100,
+      "policy": "heuristic",
+      "total_reward": -7.0629,
+      "grader_score": 0.5459,
+      "success_score": 0.6,
+      "budget_score": 0.0909,
+      "adaptation_score": 0.6111,
+      "latency_score": 0.4399,
+      "sla_score": 0.9474,
+      "success_rate": 0.6316,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2447259319640143,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 101,
+      "policy": "heuristic",
+      "total_reward": 3.4091,
+      "grader_score": 0.6753,
+      "success_score": 0.8,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.7857,
+      "latency_score": 0.5795,
+      "sla_score": 1.0,
+      "success_rate": 0.8,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 102,
+      "policy": "heuristic",
+      "total_reward": -2.5909,
+      "grader_score": 0.6228,
+      "success_score": 0.7,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.6932,
+      "latency_score": 0.5593,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 103,
+      "policy": "heuristic",
+      "total_reward": -2.8182,
+      "grader_score": 0.6003,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.75,
+      "latency_score": 0.4991,
+      "sla_score": 1.0,
+      "success_rate": 0.7222,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 104,
+      "policy": "heuristic",
+      "total_reward": -1.6364,
+      "grader_score": 0.6432,
+      "success_score": 0.7,
+      "budget_score": 0.2727,
+      "adaptation_score": 0.6607,
+      "latency_score": 0.5509,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 100,
+      "policy": "llm",
+      "total_reward": -5.6364,
+      "grader_score": 0.5687,
+      "success_score": 0.6,
+      "budget_score": 0.0727,
+      "adaptation_score": 0.6458,
+      "latency_score": 0.493,
+      "sla_score": 1.0,
+      "success_rate": 0.6667,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 101,
+      "policy": "llm",
+      "total_reward": 6.4091,
+      "grader_score": 0.7038,
+      "success_score": 0.8,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.9,
+      "latency_score": 0.6076,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 102,
+      "policy": "llm",
+      "total_reward": 6.0909,
+      "grader_score": 0.707,
+      "success_score": 0.8,
+      "budget_score": 0.0182,
+      "adaptation_score": 0.9091,
+      "latency_score": 0.6621,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 103,
+      "policy": "llm",
+      "total_reward": -1.3182,
+      "grader_score": 0.6135,
+      "success_score": 0.65,
+      "budget_score": 0.0364,
+      "adaptation_score": 0.7946,
+      "latency_score": 0.5208,
+      "sla_score": 1.0,
+      "success_rate": 0.7647,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5,
+        -0.5
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 104,
+      "policy": "llm",
+      "total_reward": 6.3636,
+      "grader_score": 0.6953,
+      "success_score": 0.8,
+      "budget_score": 0.0727,
+      "adaptation_score": 0.8583,
+      "latency_score": 0.6138,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    }
+  ]
+}

eval/outputs/prompt_audit/budget_guard_heldout5/eval_summary_20260425_163956.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Budget Router Evaluation — 20260425_163956
+| Task | HEURISTIC Grader | LLM Grader | Notes |
+|---|---|---|---|
+| Hard_Multi | 0.6175 (n=5) | 0.6577 (n=5) | LLM +4.0 points vs heuristic |

eval/outputs/trace_compare/eval_seed101/eval_results_20260425_192545.json ADDED Viewed

	@@ -0,0 +1,149 @@

+{
+  "metadata": {
+    "timestamp": "20260425_192545",
+    "policies": [
+      "heuristic",
+      "llm"
+    ],
+    "tasks": [
+      "hard_multi"
+    ],
+    "seeds": [
+      101
+    ]
+  },
+  "summary": {
+    "hard_multi|heuristic": {
+      "grader_mean": 0.6753,
+      "reward_mean": 3.4091,
+      "success_rate": 0.8,
+      "adaptation": 0.7857,
+      "n": 1
+    },
+    "hard_multi|llm": {
+      "grader_mean": 0.7038,
+      "reward_mean": 6.4091,
+      "success_rate": 0.8889,
+      "adaptation": 0.9,
+      "n": 1
+    }
+  },
+  "episodes": [
+    {
+      "task": "hard_multi",
+      "seed": 101,
+      "policy": "heuristic",
+      "total_reward": 3.4091,
+      "grader_score": 0.6753,
+      "success_score": 0.8,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.7857,
+      "latency_score": 0.5795,
+      "sla_score": 1.0,
+      "success_rate": 0.8,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 101,
+      "policy": "llm",
+      "total_reward": 6.4091,
+      "grader_score": 0.7038,
+      "success_score": 0.8,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.9,
+      "latency_score": 0.6076,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    }
+  ]
+}

eval/outputs/trace_compare/eval_seed101/eval_results_20260425_192656.json ADDED Viewed

	@@ -0,0 +1,149 @@

+{
+  "metadata": {
+    "timestamp": "20260425_192656",
+    "policies": [
+      "heuristic",
+      "llm"
+    ],
+    "tasks": [
+      "hard_multi"
+    ],
+    "seeds": [
+      102
+    ]
+  },
+  "summary": {
+    "hard_multi|heuristic": {
+      "grader_mean": 0.6228,
+      "reward_mean": -2.5909,
+      "success_rate": 0.7,
+      "adaptation": 0.6932,
+      "n": 1
+    },
+    "hard_multi|llm": {
+      "grader_mean": 0.707,
+      "reward_mean": 6.0909,
+      "success_rate": 0.8889,
+      "adaptation": 0.9091,
+      "n": 1
+    }
+  },
+  "episodes": [
+    {
+      "task": "hard_multi",
+      "seed": 102,
+      "policy": "heuristic",
+      "total_reward": -2.5909,
+      "grader_score": 0.6228,
+      "success_score": 0.7,
+      "budget_score": 0.0818,
+      "adaptation_score": 0.6932,
+      "latency_score": 0.5593,
+      "sla_score": 1.0,
+      "success_rate": 0.7,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        -2.0454545454545454,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.7727272727272727,
+        -2.2272727272727275,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454
+      ]
+    },
+    {
+      "task": "hard_multi",
+      "seed": 102,
+      "policy": "llm",
+      "total_reward": 6.0909,
+      "grader_score": 0.707,
+      "success_score": 0.8,
+      "budget_score": 0.0182,
+      "adaptation_score": 0.9091,
+      "latency_score": 0.6621,
+      "sla_score": 1.0,
+      "success_rate": 0.8889,
+      "steps": 20,
+      "actions": [
+        "route_to_a",
+        "route_to_a",
+        "route_to_a",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_b",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "route_to_c",
+        "shed_load",
+        "shed_load"
+      ],
+      "rewards": [
+        0.9545454545454546,
+        0.9545454545454546,
+        -2.0454545454545454,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        0.7727272727272727,
+        -2.2272727272727275,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        0.5454545454545454,
+        -0.5,
+        -0.5
+      ]
+    }
+  ]
+}

eval/outputs/trace_compare/eval_seed101/eval_summary_20260425_192545.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Budget Router Evaluation — 20260425_192545
+| Task | HEURISTIC Grader | LLM Grader | Notes |
+|---|---|---|---|
+| Hard_Multi | 0.6753 (n=1) | 0.7038 (n=1) | LLM +2.8 points vs heuristic |

eval/outputs/trace_compare/eval_seed101/eval_summary_20260425_192656.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Budget Router Evaluation — 20260425_192656
+| Task | HEURISTIC Grader | LLM Grader | Notes |
+|---|---|---|---|
+| Hard_Multi | 0.6228 (n=1) | 0.7070 (n=1) | LLM +8.4 points vs heuristic |

eval/trace_episode.py ADDED Viewed

	@@ -0,0 +1,357 @@

+#!/usr/bin/env python3
+"""
+Trace one Budget Router episode for a chosen policy, task, and seed.
+This is a debugging/evidence tool: it prints per-step actions, step rewards,
+costs, success/failure, latency, cumulative reward, and final grader metrics.
+It does not expose hidden provider health to the policy.
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+import typer
+# Ensure imports work when run as `uv run python eval/trace_episode.py`.
+sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+from budget_router.environment import BudgetRouterEnv
+from budget_router.models import Action, ActionType, Observation, TaskConfig
+from budget_router.policies import heuristic_baseline_policy
+from budget_router.reward import episode_metrics, grade_episode
+from budget_router.tasks import TASK_PRESETS
+from inference import LLMRouter
+app = typer.Typer(add_completion=False)
+POLICIES = {"heuristic", "llm", "ppo"}
+DEFAULT_PPO_MODELS = {
+    "easy": Path("trained_models/ppo_easy_50k.zip"),
+    "hard_multi": Path("trained_models/ppo_hard_multi_100k.zip"),
+}
+# Matches train/gym_wrapper.BudgetRouterGymEnv action order (Discrete 0..3).
+_PPO_ACTION_NAMES = ("route_to_a", "route_to_b", "route_to_c", "shed_load")
+def _echo_step_progress(
+    *,
+    policy_label: str,
+    step: int,
+    action: str,
+    reward: float,
+    cumulative: float,
+    done: bool,
+    llm_error: Optional[str] = None,
+    verbose: bool,
+) -> None:
+    if not verbose:
+        return
+    err = f" llm_error={llm_error}" if llm_error else ""
+    typer.echo(
+        f"[trace] policy={policy_label} step={step} action={action} "
+        f"reward={reward:+.3f} cum={cumulative:+.3f} done={done}{err}"
+    )
+def _visible_observation_row(obs: Observation) -> Dict[str, float]:
+    """Public observation values available to the policy before it acts."""
+    return {
+        "provider_a_status": round(float(obs.provider_a_status), 4),
+        "provider_b_status": round(float(obs.provider_b_status), 4),
+        "provider_c_status": round(float(obs.provider_c_status), 4),
+        "observed_budget_remaining": round(float(obs.budget_remaining), 4),
+        "queue_backlog": round(float(obs.queue_backlog), 4),
+        "system_latency": round(float(obs.system_latency), 4),
+        "step_count": round(float(obs.step_count), 4),
+    }
+def _visible_observation_row_from_array(values: Any) -> Dict[str, float]:
+    """Public observation values from the Gym wrapper's 7-field observation array."""
+    return {
+        "provider_a_status": round(float(values[0]), 4),
+        "provider_b_status": round(float(values[1]), 4),
+        "provider_c_status": round(float(values[2]), 4),
+        "observed_budget_remaining": round(float(values[3]), 4),
+        "queue_backlog": round(float(values[4]), 4),
+        "system_latency": round(float(values[5]), 4),
+        "step_count": round(float(values[6]), 4),
+    }
+def _cumulative_step_rows(
+    history: List[Dict[str, Any]],
+    visible_observations: List[Dict[str, float]],
+) -> List[Dict[str, Any]]:
+    rows: List[Dict[str, Any]] = []
+    cumulative_reward = 0.0
+    cumulative_cost = 0.0
+    for item in history:
+        reward = float(item.get("reward", 0.0) or 0.0)
+        cost = float(item.get("cost", 0.0) or 0.0)
+        initial_budget = float(item.get("initial_budget", 0.0) or 0.0)
+        cumulative_reward += reward
+        cumulative_cost += cost
+        budget_remaining = max(0.0, initial_budget - cumulative_cost)
+        obs_row = visible_observations[len(rows)] if len(rows) < len(visible_observations) else {}
+        rows.append({
+            "step": int(item.get("step", len(rows) + 1)),
+            "action": item.get("action_type"),
+            "provider": item.get("provider"),
+            "success": bool(item.get("request_succeeded", False)),
+            "reward": round(reward, 4),
+            "cumulative_reward": round(cumulative_reward, 4),
+            "cost": round(cost, 4),
+            "budget_remaining": round(budget_remaining, 4),
+            "latency_ms": float(item.get("latency_ms", 0.0) or 0.0),
+            "queue_overflow": bool(item.get("queue_overflow", False)),
+            "budget_exhausted": bool(item.get("budget_exhausted", False)),
+            **obs_row,
+        })
+    return rows
+def _run_heuristic(
+    task_cfg: TaskConfig, seed: int, *, verbose: bool = False
+) -> tuple[BudgetRouterEnv, List[Dict[str, float]]]:
+    env = BudgetRouterEnv()
+    obs = env.reset(seed=seed, scenario=task_cfg)
+    visible_observations = []
+    cumulative = 0.0
+    while not obs.done:
+        visible_observations.append(_visible_observation_row(obs))
+        action = heuristic_baseline_policy(obs)
+        action_str = action.action_type.value
+        obs = env.step(action)
+        r = float(obs.reward or 0.0)
+        cumulative += r
+        _echo_step_progress(
+            policy_label="heuristic",
+            step=int(env._internal.current_step),
+            action=action_str,
+            reward=r,
+            cumulative=cumulative,
+            done=bool(obs.done),
+            verbose=verbose,
+        )
+    return env, visible_observations
+def _run_llm(
+    task_name: str, task_cfg: TaskConfig, seed: int, *, verbose: bool = False
+) -> tuple[BudgetRouterEnv, List[Dict[str, float]]]:
+    api_key = os.getenv("API_KEY") or os.getenv("HF_TOKEN")
+    api_base_url = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+    model_name = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+    if not api_key:
+        raise RuntimeError("LLM policy requires HF_TOKEN or API_KEY.")
+    policy = LLMRouter(api_base_url=api_base_url, model_name=model_name, api_key=api_key)
+    policy.reset(task_name=task_name)
+    env = BudgetRouterEnv()
+    obs = env.reset(seed=seed, scenario=task_cfg)
+    visible_observations = []
+    cumulative = 0.0
+    if verbose:
+        typer.echo(
+            f"[trace] begin policy=llm task={task_name} seed={seed} "
+            f"endpoint={api_base_url} model={model_name} "
+            f"(~{task_cfg.max_steps} sequential LLM calls; first call starting…)"
+        )
+    while not obs.done:
+        visible_observations.append(_visible_observation_row(obs))
+        action = policy.choose_action(obs)
+        action_str = action.action_type.value
+        obs = env.step(action)
+        r = float(obs.reward or 0.0)
+        cumulative += r
+        _echo_step_progress(
+            policy_label="llm",
+            step=int(env._internal.current_step),
+            action=action_str,
+            reward=r,
+            cumulative=cumulative,
+            done=bool(obs.done),
+            llm_error=policy.last_error,
+            verbose=verbose,
+        )
+    return env, visible_observations
+def _default_ppo_model_path(task_name: str) -> Path:
+    if task_name not in DEFAULT_PPO_MODELS:
+        raise ValueError(
+            f"No default PPO model for task '{task_name}'. "
+            "Pass --model-path explicitly, or use task easy/hard_multi."
+        )
+    return DEFAULT_PPO_MODELS[task_name]
+def _run_ppo(
+    task_name: str,
+    task_cfg: TaskConfig,
+    seed: int,
+    model_path: Optional[Path],
+    *,
+    verbose: bool = False,
+) -> tuple[BudgetRouterEnv, List[Dict[str, float]]]:
+    # Lazy import keeps heuristic/LLM tracing available without training extras.
+    try:
+        from stable_baselines3 import PPO
+        from train.gym_wrapper import BudgetRouterGymEnv
+    except ImportError as exc:
+        raise RuntimeError("PPO tracing requires training dependencies: `uv sync --extra training`.") from exc
+    resolved_model_path = model_path or _default_ppo_model_path(task_name)
+    if not resolved_model_path.exists():
+        raise FileNotFoundError(f"PPO model not found: {resolved_model_path}")
+    model = PPO.load(str(resolved_model_path))
+    gym_env = BudgetRouterGymEnv(scenario=task_cfg, seed=seed)
+    obs, _ = gym_env.reset()
+    done = False
+    visible_observations = []
+    cumulative = 0.0
+    while not done:
+        visible_observations.append(_visible_observation_row_from_array(obs))
+        action_idx, _ = model.predict(obs, deterministic=True)
+        ai = int(action_idx)
+        action_str = _PPO_ACTION_NAMES[ai] if 0 <= ai < len(_PPO_ACTION_NAMES) else str(ai)
+        obs, reward, terminated, truncated, _ = gym_env.step(ai)
+        r = float(reward)
+        cumulative += r
+        done = terminated or truncated
+        inner = gym_env._env
+        _echo_step_progress(
+            policy_label="ppo",
+            step=int(inner._internal.current_step),
+            action=action_str,
+            reward=r,
+            cumulative=cumulative,
+            done=done,
+            verbose=verbose,
+        )
+    return gym_env._env, visible_observations
+def trace_episode(
+    task_name: str,
+    seed: int,
+    policy_name: str,
+    model_path: Optional[Path] = None,
+    *,
+    verbose: bool = False,
+) -> Dict[str, Any]:
+    """Run one episode and return step rows plus final scorer outputs."""
+    if task_name not in TASK_PRESETS:
+        raise ValueError(f"Unknown task '{task_name}'. Choose from: {sorted(TASK_PRESETS)}")
+    if policy_name not in POLICIES:
+        raise ValueError(f"Unknown policy '{policy_name}'. Choose from: {sorted(POLICIES)}")
+    task_cfg = TASK_PRESETS[task_name]
+    if policy_name == "heuristic":
+        env, visible_observations = _run_heuristic(task_cfg=task_cfg, seed=seed, verbose=verbose)
+    elif policy_name == "llm":
+        env, visible_observations = _run_llm(
+            task_name=task_name, task_cfg=task_cfg, seed=seed, verbose=verbose
+        )
+    else:
+        env, visible_observations = _run_ppo(
+            task_name=task_name,
+            task_cfg=task_cfg,
+            seed=seed,
+            model_path=model_path,
+            verbose=verbose,
+        )
+    history = env._internal.history
+    steps = _cumulative_step_rows(history, visible_observations)
+    grader = {k: round(float(v), 4) for k, v in grade_episode(history).items()}
+    return {
+        "task": task_name,
+        "seed": seed,
+        "policy": policy_name,
+        "episode_length": len(steps),
+        "total_reward": round(sum(row["reward"] for row in steps), 4),
+        "grader": grader,
+        "metrics": episode_metrics(history),
+        "steps": steps,
+    }
+def _print_trace(result: Dict[str, Any]) -> None:
+    typer.echo(f"Task={result['task']}  Policy={result['policy']}  Seed={result['seed']}")
+    typer.echo(f"Episode length={result['episode_length']}  Total reward={result['total_reward']:+.4f}")
+    typer.echo("Grader:")
+    for key, value in result["grader"].items():
+        typer.echo(f"  {key}: {value:.4f}")
+    typer.echo("")
+    typer.echo(
+        "Step | A_stat | B_stat | C_stat | Action      | Provider | Success | "
+        "Reward  | CumReward | Cost | Budget | Latency | Flags"
+    )
+    typer.echo(
+        "-----|--------|--------|--------|-------------|----------|---------|"
+        "---------|-----------|------|--------|---------|------"
+    )
+    for row in result["steps"]:
+        flags = []
+        if row["queue_overflow"]:
+            flags.append("queue_overflow")
+        if row["budget_exhausted"]:
+            flags.append("budget_exhausted")
+        typer.echo(
+            f"{row['step']:>4} | {row.get('provider_a_status', 0.0):>6.3f} | "
+            f"{row.get('provider_b_status', 0.0):>6.3f} | "
+            f"{row.get('provider_c_status', 0.0):>6.3f} | "
+            f"{row['action']:<11} | {str(row['provider'] or '-'):>8} | "
+            f"{str(row['success']).lower():>7} | {row['reward']:>+7.2f} | "
+            f"{row['cumulative_reward']:>+9.2f} | {row['cost']:>4.2f} | "
+            f"{row['budget_remaining']:>6.2f} | {row['latency_ms']:>7.2f} | {','.join(flags) or '-'}"
+        )
+@app.command()
+def main(
+    task: str = typer.Option("hard_multi", help=f"Task name: {' | '.join(TASK_PRESETS)}"),
+    seed: int = typer.Option(..., help="Exact episode seed."),
+    policy: str = typer.Option("heuristic", help=f"Policy: {' | '.join(sorted(POLICIES))}"),
+    model_path: Optional[Path] = typer.Option(None, help="PPO model path. Defaults exist for easy/hard_multi."),
+    output_json: Optional[Path] = typer.Option(None, help="Optional path to save the full trace JSON."),
+    verbose: bool = typer.Option(
+        False,
+        "--verbose",
+        "-v",
+        help="Print one line per env step during the episode (useful for slow LLM runs).",
+    ),
+) -> None:
+    """Run and print a single exact-seed episode trace."""
+    result = trace_episode(
+        task_name=task,
+        seed=seed,
+        policy_name=policy,
+        model_path=model_path,
+        verbose=verbose,
+    )
+    _print_trace(result)
+    if output_json is not None:
+        output_json.parent.mkdir(parents=True, exist_ok=True)
+        output_json.write_text(json.dumps(result, indent=2) + "\n", encoding="utf-8")
+        typer.echo(f"\nSaved trace JSON: {output_json}")
+if __name__ == "__main__":
+    app()

eval_sft.py ADDED Viewed

	@@ -0,0 +1,488 @@

+#!/usr/bin/env python3
+# /// script
+# dependencies = [
+#   "torch",
+#   "transformers>=4.45.0",
+#   "huggingface_hub>=0.24.0",
+#   "scipy",
+#   "budget-router @ git+https://huggingface.co/spaces/akshay4/budget-router-openenv",
+# ]
+# ///
+"""Evaluate a Budget Router SFT model against the heuristic baseline."""
+from __future__ import annotations
+import argparse
+import json
+import math
+import os
+import time
+from pathlib import Path
+from typing import Any
+import numpy as np
+from budget_router.environment import BudgetRouterEnv
+from budget_router.models import Action, ActionType, Observation, TaskConfig
+from budget_router.policies import heuristic_baseline_policy
+from budget_router.reward import episode_metrics, grade_episode
+from budget_router.tasks import HARD_MULTI, TASK_PRESETS
+try:
+    from inference import SYSTEM_PROMPT
+    _SYSTEM_PROMPT_SOURCE = "inference"
+except ModuleNotFoundError as exc:
+    if exc.name != "inference":
+        raise
+    SYSTEM_PROMPT = """
+You are a cost-aware LLM API routing agent managing a production system.
+At each step, output EXACTLY ONE action string. Nothing else.
+ENVIRONMENT:
+  Three providers: A ($0.01/req, cheapest), B ($0.05/req), C ($0.10/req, most reliable).
+  provider_X_status = windowed success rate [0=always fails, 1=always succeeds].
+    IMPORTANT: A status of exactly 0.500 means this provider has NEVER been routed to
+    in this episode — it is unobserved, not confirmed healthy. Route to it once to get
+    a real reading. Do not treat 0.500 as a health signal.
+  budget_remaining: fraction of budget left. Reaching 0 = catastrophic -10 penalty.
+  step_count [0→1], steps_remaining: episode progress (20 steps total).
+VALID ACTIONS (output ONLY one):
+  route_to_a | route_to_b | route_to_c | shed_load
+GOLDEN RULE — DEFAULT STRATEGY:
+  Stay on the CHEAPEST provider whose status > 0.52. Only deviate if there is CLEAR, SUSTAINED evidence of degradation (defined below). Unnecessary switching to expensive providers burns budget and reduces your score.
+NOISE CALIBRATION (critical):
+- Status fluctuates due to Bernoulli sampling noise. Single-step dips are not reliable signals.
+- Use the provided 2-step trend (avg/step): a sustained negative trend across multiple steps
+indicates real degradation; a trend near 0 means the provider is stable. Do NOT switch on noise.
+- REAL degradation signal: sustained negative trend AND current status is visibly declining.
+- Only when both conditions hold across consecutive observations should you consider early switching.
+- On stable tasks, trends hover near zero. Switching on noise burns budget without benefit.
+WHEN TO SWITCH (use your conversation history):
+A → B: When trend_a is clearly and consistently negative AND status_a is approaching unreliable,
+           OR status_a is already below 0.52 (failure probability exceeds success probability).
+B → C: Same principle — sustained decline signals, not single-step noise.
+Never switch based on a single bad observation — noise causes occasional dips.
+BUDGET RUNWAY — HARD CONSTRAINT:
+budget_runway_at_current_rate shows how many more steps you can afford at current spend rate.
+If budget_runway_at_current_rate < steps_remaining: switch to a cheaper provider IMMEDIATELY.
+If budget_remaining < 0.15 (less than 15% left): treat C as OFF-LIMITS unless A and B are
+  both below 0.30 status. Prefer shed_load over routing C when budget is this low.
+NEVER route to any provider if doing so would leave budget_remaining below the cost of
+that provider times the steps_remaining. The -10 bankruptcy penalty destroys all episode
+value accumulated so far — budget survival is non-negotiable.
+TASK PROFILES (the task name appears in each observation — use it):
+  easy:       Stable environment. Trend fluctuations are mostly noise. Stay on the cheapest provider unless its trend is catastrophically and sustainedly negative.
+  medium:     Dynamic environment. A provider may degrade mid-episode. Monitor trends and switch to the next cheapest healthy fallback if the primary fails.
+  hard / hard_multi: Hostile, multi-failure environments. Multiple providers may degrade at unexpected times in unpredictable sequences.
+              Your Runbook: Always map traffic to the lowest-cost healthy provider (A=$0.01, B=$0.05, C=$0.10).
+              Watch your conversation history: if your currently active provider shows a clear, sustained negative trend, switch early to the next cheapest option that is healthy.
+              CRITICAL: Before switching to expensive fallbacks (like C), use budget_runway to verify you can afford them to prevent budget exhaustion.
+Output only the action string."""
+    _SYSTEM_PROMPT_SOURCE = "embedded_fallback"
+_AGENT_DEBUG_LOG = "/Users/akshaybabbar/Desktop/work/.cursor/debug-e4cac3.log"
+def _agent_debug_ndjson(payload: dict[str, object]) -> None:
+    line = json.dumps(payload)
+    try:
+        with open(_AGENT_DEBUG_LOG, "a", encoding="utf-8") as f:
+            f.write(line + "\n")
+    except OSError:
+        print(f"[agent-debug] {line}", flush=True)
+VALID_ACTIONS = ["route_to_a", "route_to_b", "route_to_c", "shed_load"]
+DEFAULT_MODEL_REPO = "akshay4/budget-router-sft-qwen1.5b"
+def _steps_remaining(obs: Observation, max_steps: int = 20) -> int:
+    elapsed = int(round(float(obs.step_count) * max_steps))
+    return max(0, max_steps - elapsed)
+def _trend_text(obs: Observation, previous_obs: Observation | None, previous2_obs: Observation | None) -> str:
+    if previous2_obs is not None:
+        ta = (obs.provider_a_status - previous2_obs.provider_a_status) / 2.0
+        tb = (obs.provider_b_status - previous2_obs.provider_b_status) / 2.0
+        tc = (obs.provider_c_status - previous2_obs.provider_c_status) / 2.0
+        return f"trend (avg/step, 2-step): A:{ta:+.3f} B:{tb:+.3f} C:{tc:+.3f}"
+    if previous_obs is not None:
+        ta = obs.provider_a_status - previous_obs.provider_a_status
+        tb = obs.provider_b_status - previous_obs.provider_b_status
+        tc = obs.provider_c_status - previous_obs.provider_c_status
+        return f"trend (1-step only, noisy): A:{ta:+.3f} B:{tb:+.3f} C:{tc:+.3f}"
+    return "trend: unavailable"
+def _budget_runway_text(obs: Observation, previous_obs: Observation | None) -> str:
+    if previous_obs is None:
+        return "budget_runway_at_current_rate: >20 steps"
+    budget_spent = float(previous_obs.budget_remaining) - float(obs.budget_remaining)
+    if budget_spent <= 0.001:
+        return "budget_runway_at_current_rate: >20 steps"
+    runway = int(float(obs.budget_remaining) / budget_spent)
+    return f"budget_runway_at_current_rate: ~{runway} steps"
+def _previous_step_feedback(obs: Observation) -> str:
+    metadata = getattr(obs, "metadata", None) or {}
+    if not metadata.get("action_type"):
+        return ""
+    parts = [
+        "previous_step_feedback:",
+        f"  previous_action: {metadata.get('action_type')}",
+    ]
+    if obs.reward is not None:
+        parts.append(f"  previous_reward: {float(obs.reward):+.2f}")
+    if metadata.get("request_succeeded") is not None:
+        parts.append(f"  previous_success: {str(bool(metadata.get('request_succeeded'))).lower()}")
+    if metadata.get("cost") is not None:
+        parts.append(f"  previous_cost: {float(metadata.get('cost')):.2f}")
+    if metadata.get("latency_ms") is not None:
+        parts.append(f"  previous_latency_ms: {float(metadata.get('latency_ms')):.2f}")
+    if metadata.get("budget_exhausted"):
+        parts.append("  previous_budget_exhausted: true")
+    return "\n".join(parts)
+def format_observation_for_sft(
+    *,
+    obs: Observation,
+    task_name: str,
+    previous_obs: Observation | None,
+    previous2_obs: Observation | None,
+) -> str:
+    lines = [
+        f"task: {task_name}",
+        f"provider_a_status: {obs.provider_a_status:.3f}",
+        f"provider_b_status: {obs.provider_b_status:.3f}",
+        f"provider_c_status: {obs.provider_c_status:.3f}",
+        f"budget_remaining: {obs.budget_remaining:.3f}",
+        f"queue_backlog: {obs.queue_backlog:.3f}",
+        f"system_latency: {obs.system_latency:.3f}",
+        f"step_count: {obs.step_count:.3f}",
+        f"steps_remaining: {_steps_remaining(obs)}",
+        _trend_text(obs, previous_obs, previous2_obs),
+        _budget_runway_text(obs, previous_obs),
+    ]
+    feedback = _previous_step_feedback(obs)
+    if feedback:
+        lines.append(feedback)
+    return "\n".join(lines)
+def parse_action(text: str) -> tuple[str, bool]:
+    lowered = text.strip().lower()
+    for action in VALID_ACTIONS:
+        if action in lowered:
+            return action, True
+    return "route_to_a", False
+def apply_budget_safety_guard(action_str: str, observation: Observation, task_cfg: TaskConfig) -> str:
+    if action_str == "shed_load":
+        return action_str
+    costs = {
+        "route_to_a": task_cfg.cost_a,
+        "route_to_b": task_cfg.cost_b,
+        "route_to_c": task_cfg.cost_c,
+    }
+    selected_cost = costs.get(action_str, 0.0)
+    budget_dollars = float(observation.budget_remaining) * float(task_cfg.initial_budget)
+    if selected_cost >= budget_dollars - 1e-9:
+        return "shed_load"
+    return action_str
+def run_heuristic_episode(task_cfg: TaskConfig, seed: int) -> dict[str, Any]:
+    env = BudgetRouterEnv()
+    obs = env.reset(seed=seed, scenario=task_cfg)
+    total_reward = 0.0
+    while not obs.done:
+        obs = env.step(heuristic_baseline_policy(obs))
+        total_reward += float(obs.reward or 0.0)
+    grader = grade_episode(env._internal.history)
+    metrics = episode_metrics(env._internal.history)
+    return {
+        "grader_score": float(grader["overall_score"]),
+        "total_reward": total_reward,
+        "episode_length": env._internal.current_step,
+        "grader": grader,
+        "metrics": metrics,
+    }
+class SFTPolicy:
+    def __init__(self, model_repo: str, *, token: str | None, use_budget_guard: bool) -> None:
+        import torch
+        from transformers import AutoModelForCausalLM, AutoTokenizer
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        dtype = torch.bfloat16 if self.device == "cuda" and torch.cuda.is_bf16_supported() else torch.float16
+        self.model = AutoModelForCausalLM.from_pretrained(model_repo, torch_dtype=dtype, token=token)
+        self.model.to(self.device)
+        self.model.eval()
+        self.tokenizer = AutoTokenizer.from_pretrained(model_repo, token=token)
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+        self.use_budget_guard = use_budget_guard
+        self.messages: list[dict[str, str]] = []
+        self.previous_obs: Observation | None = None
+        self.previous2_obs: Observation | None = None
+        self.parse_failures = 0
+    def reset(self) -> None:
+        self.messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+        self.previous_obs = None
+        self.previous2_obs = None
+        self.parse_failures = 0
+    def choose_action(self, obs: Observation, *, task_name: str, task_cfg: TaskConfig) -> str:
+        import torch
+        obs_text = format_observation_for_sft(
+            obs=obs,
+            task_name=task_name,
+            previous_obs=self.previous_obs,
+            previous2_obs=self.previous2_obs,
+        )
+        self.messages.append({"role": "user", "content": obs_text})
+        prompt = self.tokenizer.apply_chat_template(
+            self.messages,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            output = self.model.generate(
+                **inputs,
+                max_new_tokens=10,
+                do_sample=False,
+                pad_token_id=self.tokenizer.eos_token_id,
+            )
+        generated = self.tokenizer.decode(
+            output[0][inputs["input_ids"].shape[1] :],
+            skip_special_tokens=True,
+        )
+        action_str, ok = parse_action(generated)
+        if not ok:
+            self.parse_failures += 1
+        if self.use_budget_guard:
+            action_str = apply_budget_safety_guard(action_str, obs, task_cfg)
+        self.messages.append({"role": "assistant", "content": action_str})
+        self.previous2_obs = self.previous_obs
+        self.previous_obs = obs
+        return action_str
+def run_sft_episode(policy: SFTPolicy, task_name: str, task_cfg: TaskConfig, seed: int) -> dict[str, Any]:
+    env = BudgetRouterEnv()
+    policy.reset()
+    obs = env.reset(seed=seed, scenario=task_cfg)
+    total_reward = 0.0
+    actions: list[str] = []
+    while not obs.done:
+        action_str = policy.choose_action(obs, task_name=task_name, task_cfg=task_cfg)
+        actions.append(action_str)
+        obs = env.step(Action(action_type=ActionType(action_str)))
+        total_reward += float(obs.reward or 0.0)
+    grader = grade_episode(env._internal.history)
+    metrics = episode_metrics(env._internal.history)
+    return {
+        "grader_score": float(grader["overall_score"]),
+        "total_reward": total_reward,
+        "episode_length": env._internal.current_step,
+        "grader": grader,
+        "metrics": metrics,
+        "actions": actions,
+        "parse_failures": policy.parse_failures,
+    }
+def _mean(values: list[float]) -> float:
+    return float(sum(values) / len(values)) if values else 0.0
+def _sample_std(values: list[float]) -> float:
+    if len(values) < 2:
+        return 0.0
+    mean = _mean(values)
+    return float(math.sqrt(sum((v - mean) ** 2 for v in values) / (len(values) - 1)))
+def compute_paired_stats(heuristic_scores: list[float], sft_scores: list[float]) -> dict[str, Any]:
+    if len(heuristic_scores) != len(sft_scores):
+        raise ValueError("Paired stats require equal-length score lists.")
+    if not heuristic_scores:
+        raise ValueError("No scores provided.")
+    diffs = [s - h for h, s in zip(heuristic_scores, sft_scores)]
+    n = len(diffs)
+    delta = _mean(diffs)
+    std_diff = _sample_std(diffs)
+    if std_diff == 0.0:
+        t_stat = math.inf if delta > 0 else (-math.inf if delta < 0 else 0.0)
+        p_val = 0.0 if delta > 0 else 1.0
+        cohens_d = math.inf if delta > 0 else (-math.inf if delta < 0 else 0.0)
+    else:
+        try:
+            from scipy import stats
+            t_stat, p_val = stats.ttest_rel(sft_scores, heuristic_scores, alternative="greater")
+            cohens_d = delta / std_diff
+        except Exception:
+            t_stat = delta / (std_diff / math.sqrt(n))
+            p_val = float("nan")
+            cohens_d = delta / std_diff
+    return {
+        "n_seeds": n,
+        "mean_heuristic": _mean(heuristic_scores),
+        "mean_sft": _mean(sft_scores),
+        "std_heuristic": _sample_std(heuristic_scores),
+        "std_sft": _sample_std(sft_scores),
+        "delta": delta,
+        "t_stat": float(t_stat),
+        "p_val": float(p_val),
+        "cohens_d": float(cohens_d),
+        "significant": bool(delta > 0 and p_val < 0.05),
+        "wins": sum(1 for d in diffs if d > 0),
+        "ties": sum(1 for d in diffs if d == 0),
+        "losses": sum(1 for d in diffs if d < 0),
+    }
+def _ci95(values: list[float]) -> tuple[float, float]:
+    n = len(values)
+    mean = _mean(values)
+    if n < 2:
+        return mean, mean
+    se = _sample_std(values) / math.sqrt(n)
+    try:
+        from scipy import stats
+        lo, hi = stats.t.interval(0.95, df=n - 1, loc=mean, scale=se)
+        return float(lo), float(hi)
+    except Exception:
+        return mean - 1.96 * se, mean + 1.96 * se
+def _parse_seed_values(value: str | None, n_seeds: int) -> list[int]:
+    if value:
+        return [int(part) for part in value.replace(",", " ").split()]
+    return list(range(300, 300 + n_seeds))
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Evaluate SFT Budget Router model.")
+    parser.add_argument("--model-repo", default=os.getenv("SFT_MODEL_REPO", DEFAULT_MODEL_REPO))
+    parser.add_argument("--task", default=os.getenv("TASK_NAME", "hard_multi"), choices=sorted(TASK_PRESETS))
+    parser.add_argument("--n-seeds", type=int, default=int(os.getenv("N_SEEDS", "10")))
+    parser.add_argument("--seed-values", default=os.getenv("EVAL_SEED_VALUES"))
+    parser.add_argument("--output-json", default=os.getenv("EVAL_OUTPUT_JSON", "eval_results_sft.json"))
+    parser.add_argument("--no-budget-guard", action="store_true")
+    parser.add_argument("--no-upload", action="store_true")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    token = os.environ.get("HF_TOKEN")
+    task_cfg = TASK_PRESETS[args.task]
+    seeds = _parse_seed_values(args.seed_values, args.n_seeds)
+    # #region agent log
+    _agent_debug_ndjson(
+        {
+            "sessionId": "e4cac3",
+            "runId": os.environ.get("DEBUG_RUN_ID", "eval-import-fix"),
+            "hypothesisId": "H1",
+            "location": "eval_sft.py:main",
+            "message": "eval_startup",
+            "data": {
+                "system_prompt_source": _SYSTEM_PROMPT_SOURCE,
+                "model_repo": args.model_repo,
+                "task": args.task,
+                "n_seeds": len(seeds),
+            },
+            "timestamp": int(time.time() * 1000),
+        }
+    )
+    # #endregion
+    policy = SFTPolicy(args.model_repo, token=token, use_budget_guard=not args.no_budget_guard)
+    episodes: list[dict[str, Any]] = []
+    heuristic_scores: list[float] = []
+    sft_scores: list[float] = []
+    for seed in seeds:
+        heuristic_ep = run_heuristic_episode(task_cfg, seed)
+        sft_ep = run_sft_episode(policy, args.task, task_cfg, seed)
+        heuristic_scores.append(float(heuristic_ep["grader_score"]))
+        sft_scores.append(float(sft_ep["grader_score"]))
+        episodes.append({"seed": seed, "heuristic": heuristic_ep, "sft": sft_ep})
+        print(
+            f"[eval-sft] seed={seed} heuristic={heuristic_ep['grader_score']:.4f} "
+            f"sft={sft_ep['grader_score']:.4f} delta={sft_ep['grader_score'] - heuristic_ep['grader_score']:+.4f} "
+            f"parse_failures={sft_ep['parse_failures']}",
+            flush=True,
+        )
+    stats = compute_paired_stats(heuristic_scores, sft_scores)
+    heu_ci = _ci95(heuristic_scores)
+    sft_ci = _ci95(sft_scores)
+    result = {
+        **stats,
+        "task": args.task,
+        "seeds": seeds,
+        "heuristic_scores": heuristic_scores,
+        "sft_scores": sft_scores,
+        "heuristic_ci95": heu_ci,
+        "sft_ci95": sft_ci,
+        "budget_guard": not args.no_budget_guard,
+        "episodes": episodes,
+    }
+    Path(args.output_json).write_text(json.dumps(result, indent=2, sort_keys=True), encoding="utf-8")
+    print()
+    print("| Policy | Mean | Std | 95% CI | vs Heuristic |")
+    print("|---|---:|---:|---|---:|")
+    print(
+        f"| Heuristic | {stats['mean_heuristic']:.3f} | {stats['std_heuristic']:.3f} | "
+        f"[{heu_ci[0]:.3f}, {heu_ci[1]:.3f}] | baseline |"
+    )
+    print(
+        f"| SFT | {stats['mean_sft']:.3f} | {stats['std_sft']:.3f} | "
+        f"[{sft_ci[0]:.3f}, {sft_ci[1]:.3f}] | {stats['delta']:+.3f} |"
+    )
+    verdict = "SIGNIFICANT" if stats["significant"] else "NOT SIGNIFICANT"
+    print(
+        f"SFT: {stats['mean_sft']:.3f} vs Heuristic: {stats['mean_heuristic']:.3f} | "
+        f"delta={stats['delta']:+.3f} | t({stats['n_seeds'] - 1})={stats['t_stat']:.2f}, "
+        f"p={stats['p_val']:.4f} | {verdict} | Cohen's d={stats['cohens_d']:.2f} | "
+        f"wins/ties/losses={stats['wins']}/{stats['ties']}/{stats['losses']}"
+    )
+    if not args.no_upload:
+        if not token:
+            raise RuntimeError("HF_TOKEN must be set to upload eval JSON. Use --no-upload to skip.")
+        from huggingface_hub import upload_file
+        upload_file(
+            path_or_fileobj=args.output_json,
+            path_in_repo=Path(args.output_json).name,
+            repo_id=args.model_repo,
+            repo_type="model",
+            token=token,
+        )
+        print(f"[eval-sft] uploaded {args.output_json} to {args.model_repo}", flush=True)
+if __name__ == "__main__":
+    main()

generate_sft_data.py ADDED Viewed

	@@ -0,0 +1,361 @@

+#!/usr/bin/env python3
+"""
+Generate SFT data for Budget Router.
+Default path is deliberately zero-API-cost: distill the existing PPO hard_multi
+policy into chat transcripts, then push the dataset to the Hub for HF Jobs.
+Optional LLM labeling is available with --teacher llm, but it costs one large
+model call per environment step (20 calls per episode).
+"""
+from __future__ import annotations
+import argparse
+import json
+import math
+import os
+from pathlib import Path
+from typing import Any, Callable
+import numpy as np
+from budget_router.environment import BudgetRouterEnv
+from budget_router.models import Action, ActionType, Observation, TaskConfig
+from budget_router.policies import heuristic_baseline_policy
+from budget_router.reward import episode_metrics, grade_episode
+from budget_router.tasks import HARD_MULTI, TASK_PRESETS
+from inference import LLMRouter, SYSTEM_PROMPT
+VALID_ACTIONS = ["route_to_a", "route_to_b", "route_to_c", "shed_load"]
+PPO_ACTION_NAMES = ["route_to_a", "route_to_b", "route_to_c", "shed_load"]
+DEFAULT_DATASET_REPO = "akshay4/budget-router-sft-data"
+DEFAULT_PPO_MODEL_PATH = "trained_models/ppo_hard_multi_100k.zip"
+_PPO_POLICY_CACHE: dict[str, Callable[[Observation], str]] = {}
+def _obs_to_array(obs: Observation) -> np.ndarray:
+    return np.array(
+        [
+            obs.provider_a_status,
+            obs.provider_b_status,
+            obs.provider_c_status,
+            obs.budget_remaining,
+            obs.queue_backlog,
+            obs.system_latency,
+            obs.step_count,
+        ],
+        dtype=np.float32,
+    )
+def _steps_remaining(obs: Observation, max_steps: int = 20) -> int:
+    elapsed = int(round(float(obs.step_count) * max_steps))
+    return max(0, max_steps - elapsed)
+def _trend_text(obs: Observation, previous_obs: Observation | None, previous2_obs: Observation | None) -> str:
+    if previous2_obs is not None:
+        ta = (obs.provider_a_status - previous2_obs.provider_a_status) / 2.0
+        tb = (obs.provider_b_status - previous2_obs.provider_b_status) / 2.0
+        tc = (obs.provider_c_status - previous2_obs.provider_c_status) / 2.0
+        return f"trend (avg/step, 2-step): A:{ta:+.3f} B:{tb:+.3f} C:{tc:+.3f}"
+    if previous_obs is not None:
+        ta = obs.provider_a_status - previous_obs.provider_a_status
+        tb = obs.provider_b_status - previous_obs.provider_b_status
+        tc = obs.provider_c_status - previous_obs.provider_c_status
+        return f"trend (1-step only, noisy): A:{ta:+.3f} B:{tb:+.3f} C:{tc:+.3f}"
+    return "trend: unavailable"
+def _budget_runway_text(obs: Observation, previous_obs: Observation | None) -> str:
+    if previous_obs is None:
+        return "budget_runway_at_current_rate: >20 steps"
+    budget_spent = float(previous_obs.budget_remaining) - float(obs.budget_remaining)
+    if budget_spent <= 0.001:
+        return "budget_runway_at_current_rate: >20 steps"
+    runway = int(float(obs.budget_remaining) / budget_spent)
+    return f"budget_runway_at_current_rate: ~{runway} steps"
+def _previous_step_feedback(obs: Observation) -> str:
+    metadata = getattr(obs, "metadata", None) or {}
+    if not metadata.get("action_type"):
+        return ""
+    parts = [
+        "previous_step_feedback:",
+        f"  previous_action: {metadata.get('action_type')}",
+    ]
+    if obs.reward is not None:
+        parts.append(f"  previous_reward: {float(obs.reward):+.2f}")
+    if metadata.get("request_succeeded") is not None:
+        parts.append(f"  previous_success: {str(bool(metadata.get('request_succeeded'))).lower()}")
+    if metadata.get("cost") is not None:
+        parts.append(f"  previous_cost: {float(metadata.get('cost')):.2f}")
+    if metadata.get("latency_ms") is not None:
+        parts.append(f"  previous_latency_ms: {float(metadata.get('latency_ms')):.2f}")
+    if metadata.get("budget_exhausted"):
+        parts.append("  previous_budget_exhausted: true")
+    return "\n".join(parts)
+def format_observation_for_sft(
+    *,
+    obs: Observation,
+    task_name: str,
+    previous_obs: Observation | None,
+    previous2_obs: Observation | None,
+) -> str:
+    """Public observation text used consistently for SFT train/eval."""
+    lines = [
+        f"task: {task_name}",
+        f"provider_a_status: {obs.provider_a_status:.3f}",
+        f"provider_b_status: {obs.provider_b_status:.3f}",
+        f"provider_c_status: {obs.provider_c_status:.3f}",
+        f"budget_remaining: {obs.budget_remaining:.3f}",
+        f"queue_backlog: {obs.queue_backlog:.3f}",
+        f"system_latency: {obs.system_latency:.3f}",
+        f"step_count: {obs.step_count:.3f}",
+        f"steps_remaining: {_steps_remaining(obs)}",
+        _trend_text(obs, previous_obs, previous2_obs),
+        _budget_runway_text(obs, previous_obs),
+    ]
+    feedback = _previous_step_feedback(obs)
+    if feedback:
+        lines.append(feedback)
+    return "\n".join(lines)
+def run_heuristic_episode(task_cfg: TaskConfig, seed: int) -> dict[str, Any]:
+    env = BudgetRouterEnv()
+    obs = env.reset(seed=seed, scenario=task_cfg)
+    total_reward = 0.0
+    while not obs.done:
+        obs = env.step(heuristic_baseline_policy(obs))
+        total_reward += float(obs.reward or 0.0)
+    grader = grade_episode(env._internal.history)
+    return {
+        "grader_score": float(grader["overall_score"]),
+        "total_reward": total_reward,
+        "grader": grader,
+    }
+def _load_ppo_policy(model_path: str) -> Callable[[Observation], str]:
+    if model_path in _PPO_POLICY_CACHE:
+        return _PPO_POLICY_CACHE[model_path]
+    try:
+        from stable_baselines3 import PPO
+    except ImportError as exc:
+        raise RuntimeError(
+            "PPO teacher requires training dependencies. Run `uv sync --extra training` "
+            "or use --teacher heuristic/llm."
+        ) from exc
+    path = Path(model_path)
+    if not path.exists():
+        raise FileNotFoundError(f"PPO model not found: {path}")
+    model = PPO.load(str(path))
+    def choose(obs: Observation) -> str:
+        action_idx, _ = model.predict(_obs_to_array(obs), deterministic=True)
+        idx = int(action_idx)
+        return PPO_ACTION_NAMES[idx] if 0 <= idx < len(PPO_ACTION_NAMES) else "shed_load"
+    _PPO_POLICY_CACHE[model_path] = choose
+    return choose
+def _load_llm_policy(task_name: str) -> Callable[[Observation], str]:
+    api_key = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
+    if not api_key:
+        raise RuntimeError("LLM teacher requires HF_TOKEN or API_KEY in the environment.")
+    router = LLMRouter(
+        api_base_url=os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1"),
+        model_name=os.environ.get("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct"),
+        api_key=api_key,
+    )
+    router.reset(task_name=task_name)
+    def choose(obs: Observation) -> str:
+        return router.choose_action(obs).action_type.value
+    return choose
+def collect_teacher_episode(
+    *,
+    task_name: str,
+    task_cfg: TaskConfig,
+    seed: int,
+    teacher: str,
+    ppo_model_path: str,
+) -> dict[str, Any]:
+    if teacher == "ppo":
+        choose_action = _load_ppo_policy(ppo_model_path)
+    elif teacher == "heuristic":
+        choose_action = lambda obs: heuristic_baseline_policy(obs).action_type.value
+    elif teacher == "llm":
+        choose_action = _load_llm_policy(task_name)
+    else:
+        raise ValueError(f"Unknown teacher {teacher!r}")
+    env = BudgetRouterEnv()
+    obs = env.reset(seed=seed, scenario=task_cfg)
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+    previous2_obs: Observation | None = None
+    previous_obs: Observation | None = None
+    actions: list[str] = []
+    total_reward = 0.0
+    while not obs.done:
+        obs_text = format_observation_for_sft(
+            obs=obs,
+            task_name=task_name,
+            previous_obs=previous_obs,
+            previous2_obs=previous2_obs,
+        )
+        action_str = choose_action(obs)
+        if action_str not in VALID_ACTIONS:
+            action_str = "shed_load"
+        messages.append({"role": "user", "content": obs_text})
+        messages.append({"role": "assistant", "content": action_str})
+        actions.append(action_str)
+        previous2_obs = previous_obs
+        previous_obs = obs
+        obs = env.step(Action(action_type=ActionType(action_str)))
+        total_reward += float(obs.reward or 0.0)
+    grader = grade_episode(env._internal.history)
+    return {
+        "seed": seed,
+        "teacher": teacher,
+        "messages": messages,
+        "actions": actions,
+        "grader_score": float(grader["overall_score"]),
+        "total_reward": total_reward,
+        "grader": grader,
+        "metrics": episode_metrics(env._internal.history),
+    }
+def select_training_rows(
+    episodes: list[dict[str, Any]],
+    *,
+    top_fraction: float,
+    min_keep: int,
+    min_delta: float,
+) -> list[dict[str, Any]]:
+    ranked = sorted(episodes, key=lambda item: float(item["delta_vs_heuristic"]), reverse=True)
+    target = max(min_keep, int(math.ceil(len(ranked) * top_fraction)))
+    positive = [ep for ep in ranked if float(ep["delta_vs_heuristic"]) >= min_delta]
+    source = positive if len(positive) >= min_keep else ranked
+    return source[: min(target, len(source))]
+def write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8") as f:
+        for row in rows:
+            f.write(json.dumps(row, sort_keys=True) + "\n")
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Generate Budget Router SFT dataset.")
+    parser.add_argument("--teacher", choices=["ppo", "heuristic", "llm"], default=os.getenv("TEACHER_POLICY", "ppo"))
+    parser.add_argument("--task", default=os.getenv("TASK_NAME", "hard_multi"), choices=sorted(TASK_PRESETS))
+    parser.add_argument("--start-seed", type=int, default=int(os.getenv("SFT_START_SEED", "1000")))
+    parser.add_argument("--n-episodes", type=int, default=int(os.getenv("SFT_N_EPISODES", "100")))
+    parser.add_argument("--top-fraction", type=float, default=float(os.getenv("SFT_TOP_FRACTION", "0.30")))
+    parser.add_argument("--min-keep", type=int, default=int(os.getenv("SFT_MIN_KEEP", "20")))
+    parser.add_argument("--min-delta", type=float, default=float(os.getenv("SFT_MIN_DELTA", "0.0")))
+    parser.add_argument("--ppo-model-path", default=os.getenv("PPO_MODEL_PATH", DEFAULT_PPO_MODEL_PATH))
+    parser.add_argument("--dataset-repo", default=os.getenv("DATASET_REPO", DEFAULT_DATASET_REPO))
+    parser.add_argument("--local-jsonl", default=os.getenv("SFT_LOCAL_JSONL", "outputs/sft_dataset.jsonl"))
+    parser.add_argument("--no-push", action="store_true", help="Write local JSONL only; do not push to Hub.")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    task_cfg = TASK_PRESETS[args.task]
+    seeds = list(range(args.start_seed, args.start_seed + args.n_episodes))
+    if args.teacher == "llm":
+        print(
+            f"[sft-data] teacher=llm n_episodes={args.n_episodes}; "
+            f"expected large-model calls <= {args.n_episodes * task_cfg.max_steps}",
+            flush=True,
+        )
+    else:
+        print(f"[sft-data] teacher={args.teacher} uses 0 large-LLM calls", flush=True)
+    episodes: list[dict[str, Any]] = []
+    for i, seed in enumerate(seeds, start=1):
+        teacher_ep = collect_teacher_episode(
+            task_name=args.task,
+            task_cfg=task_cfg,
+            seed=seed,
+            teacher=args.teacher,
+            ppo_model_path=args.ppo_model_path,
+        )
+        heuristic_ep = run_heuristic_episode(task_cfg, seed)
+        delta = teacher_ep["grader_score"] - heuristic_ep["grader_score"]
+        teacher_ep["heuristic_score"] = heuristic_ep["grader_score"]
+        teacher_ep["delta_vs_heuristic"] = delta
+        episodes.append(teacher_ep)
+        print(
+            f"[sft-data] {i:03d}/{len(seeds)} seed={seed} "
+            f"teacher={teacher_ep['grader_score']:.4f} heuristic={heuristic_ep['grader_score']:.4f} "
+            f"delta={delta:+.4f}",
+            flush=True,
+        )
+    kept = select_training_rows(
+        episodes,
+        top_fraction=args.top_fraction,
+        min_keep=args.min_keep,
+        min_delta=args.min_delta,
+    )
+    dataset_rows = [
+        {
+            "messages": ep["messages"],
+            "seed": ep["seed"],
+            "teacher": ep["teacher"],
+            "teacher_score": ep["grader_score"],
+            "heuristic_score": ep["heuristic_score"],
+            "delta_vs_heuristic": ep["delta_vs_heuristic"],
+            "actions": ep["actions"],
+        }
+        for ep in kept
+    ]
+    write_jsonl(Path(args.local_jsonl), dataset_rows)
+    mean_all = sum(float(ep["grader_score"]) for ep in episodes) / len(episodes)
+    mean_kept = sum(float(ep["grader_score"]) for ep in kept) / len(kept)
+    mean_delta = sum(float(ep["delta_vs_heuristic"]) for ep in kept) / len(kept)
+    print(
+        "[sft-data] summary "
+        f"generated={len(episodes)} kept={len(kept)} mean_all={mean_all:.4f} "
+        f"mean_kept={mean_kept:.4f} mean_delta_kept={mean_delta:+.4f} "
+        f"local_jsonl={args.local_jsonl}",
+        flush=True,
+    )
+    if not args.no_push:
+        token = os.environ.get("HF_TOKEN")
+        if not token:
+            raise RuntimeError("HF_TOKEN must be set to push the dataset. Use --no-push for local only.")
+        from datasets import Dataset
+        Dataset.from_list(dataset_rows).push_to_hub(args.dataset_repo, token=token)
+        print(f"[sft-data] pushed dataset to https://huggingface.co/datasets/{args.dataset_repo}", flush=True)
+if __name__ == "__main__":
+    main()

gradio_ui/__init__.py ADDED Viewed

File without changes

gradio_ui/config.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from __future__ import annotations
+MAX_STEPS = 20
+SCENARIOS = ["easy", "medium", "hard", "hard_multi"]
+POLICY_CHOICES = [
+    ("Heuristic", "heuristic"),
+    ("LLM", "llm"),
+]
+try:
+    import stable_baselines3  # type: ignore  # noqa: F401
+    POLICY_CHOICES.append(("PPO (hard_multi)", "ppo"))
+except Exception:
+    pass
+PPO_MODEL_PATH = "trained_models/ppo_hard_multi_100k.zip"

gradio_ui/legacy_api.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from __future__ import annotations
+from typing import Dict, Optional, Tuple
+import requests
+BASE_URL = "http://localhost:8000"
+AUTO_PLAY_DELAY = 0.5
+class APIClient:
+    """Single-responsibility HTTP client for the OpenEnv Budget Router API."""
+    def __init__(self, base_url: str = BASE_URL) -> None:
+        self.base_url = base_url.rstrip("/")
+    def _post(self, path: str, body: Dict) -> Tuple[Optional[Dict], Optional[str]]:
+        try:
+            r = requests.post(f"{self.base_url}{path}", json=body, timeout=15)
+            r.raise_for_status()
+            return r.json(), None
+        except Exception as exc:
+            return None, str(exc)
+    def _get(self, path: str) -> Tuple[Optional[Dict], Optional[str]]:
+        try:
+            r = requests.get(f"{self.base_url}{path}", timeout=10)
+            r.raise_for_status()
+            return r.json(), None
+        except Exception as exc:
+            return None, str(exc)
+    @staticmethod
+    def _normalize(payload: Dict):
+        """Handle both flat and observation-wrapped response shapes."""
+        obs = payload.get("observation", payload)
+        reward = float(payload.get("reward", obs.get("reward", 0.0)) or 0.0)
+        meta = payload.get("metadata", obs.get("metadata", {})) or {}
+        done = bool(payload.get("done", obs.get("done", False)))
+        return obs, reward, meta, done
+    def reset(self, seed: int, scenario: str):
+        data, err = self._post("/reset", {"seed": seed, "scenario": scenario})
+        if err:
+            return None, err
+        obs, _, _, _ = self._normalize(data)
+        return obs, None
+    def step(self, action_type: str):
+        data, err = self._post("/step", {"action_type": action_type})
+        if err:
+            return None, err
+        return self._normalize(data), None
+    def state(self):
+        return self._get("/state")