test-rl-hackathon-budget

Sleeping

App Files Files Community

akshay4 commited on 11 days ago

Commit

7e81fc2

verified ·

1 Parent(s): 00de9ad

Upload blog.md with huggingface_hub

Browse files

Files changed (1) hide show

blog.md +282 -196

blog.md CHANGED Viewed

@@ -1,257 +1,343 @@
-# Budget Router: Teaching Agents to Survive Cascading API Failures Under Budget
-Production AI systems do not fail politely.
-An application may depend on several LLM or API providers, each with different cost, latency, and reliability profiles. One provider becomes flaky. Traffic shifts. The next fallback becomes overloaded or starts degrading. The system still has a budget, users still expect latency, and the router never sees the true internal health of the providers. It only sees noisy public signals: recent success rates, backlog, latency, and remaining budget.
-That is the problem Budget Router is built to study.
-Budget Router is an OpenEnv-compliant reinforcement learning environment where an agent routes each request to Provider A, B, C, or sheds load. A is cheap, B is moderate, C is reliable but expensive. The agent's job is not simply to pick the best provider now. It must preserve enough budget to survive what happens later.
-The interesting case is `Hard_Multi`: Provider A degrades from the beginning, and Provider B cascades later in the episode. This creates a two-phase incident. A naive router can look reasonable early and still fail late because it spent too much budget before the real cascade arrived.
-This is a small environment, but it captures a real infrastructure question:
-> Can an agent learn budget-aware reliability behavior under partial observability and non-stationary provider degradation?
 ## TL;DR
-Budget Router is not a claim that a 20-step toy simulation is production routing. It is a compact, reproducible benchmark for a production-shaped failure mode: budgeted API routing under cascading degradation.
-On the headline `Hard_Multi` task, we compare three policy families:
-| Policy | What it is | Hard_Multi grader | Main takeaway |
-|---|---|---:|---|
-| Heuristic | Hand-coded reactive baseline | ~0.61 | A real baseline, but brittle under cascade failure |
-| Zero-shot LLM | Qwen2.5-72B with a deterministic budget guard | ~0.65 | In-context reasoning helps when observations are semantically meaningful |
-| PPO | Small SB3 MLP trained on the environment | ~0.69 | The reward signal is learnable and stronger than hand rules |
-```mermaid
-flowchart LR
-    H["Heuristic baseline<br/>0.61<br/>hand-coded rules"] --> L["Zero-shot LLM<br/>0.65<br/>Qwen2.5-72B + budget guard"]
-    L --> P["Trained PPO<br/>0.69<br/>SB3 MLP, 100k steps"]
 ```
-We also ran post-training experiments beyond PPO:
-- SFT on Qwen2.5-1.5B via Hugging Face Jobs completed end-to-end, but did **not** beat the heuristic on the latest 10-seed evaluation: `0.577` vs `0.596`, with 3/10 wins.
-- GRPO was attempted, but did not converge reliably in our setup.
-- The negative result is useful: this environment rewards sequential credit assignment, probing, recovery, and budget conservation. Plain behavioral cloning can imitate action patterns without learning why those actions matter.
-![Budget Router evidence](figures/budget_router_evidence.png)
-*Figure: README evidence summary. The strongest claims are the three-policy ordering on `Hard_Multi`, heldout/fresh seed generalization for the LLM, and adaptation-score gains over the reactive heuristic.*
-## The Environment
-Budget Router exposes a simple action space:
-- `route_to_a`
-- `route_to_b`
-- `route_to_c`
-- `shed_load`
-The observation is intentionally public and partial. The policy sees:
-- rolling provider success estimates,
-- remaining budget,
-- queue backlog,
-- system latency,
-- episode progress.
-It does **not** see the true hidden provider health. This makes the problem a partially observable decision problem rather than a lookup table. The agent has to infer whether a provider is actually degrading or whether it just saw noise.
-The task suite escalates difficulty:
-| Task | Degradation pattern | Why it matters |
-|---|---|---|
-| `Easy` | No degradation | Budget-conservative rules are hard to beat |
-| `Medium` | A degrades after step 5 | Reactive switching begins to matter |
-| `Hard` | A degrades from step 0 | Early adaptation matters |
-| `Hard_Multi` | A degrades from step 0, B from step 10 | Cascade failure forces budget-aware anticipation |
-`Hard_Multi` is the core benchmark. If the router burns money on expensive fallbacks too early, it may have no budget left when B starts failing. If it stays cheap for too long, it loses success and SLA. If it sheds load too often, it avoids cost but fails the user.
-That is the point: there is no single dominant action.
-## The Grader
-The episode grader is a weighted score in `[0, 1]`:
-```text
-overall = 0.30 * success
-        + 0.20 * latency
-        + 0.15 * budget
-        + 0.15 * SLA
-        + 0.20 * adaptation
-```
-The grader is designed so that obvious reward hacks are unattractive:
-| Shortcut | Why it fails |
-|---|---|
-| Always route to C | Good latency, but expensive and budget-risky |
-| Always shed load | Avoids cost, but earns no success or adaptation |
-| Always use A | Cheap, but collapses once A degrades |
-| Switch only after failure | Too late in `Hard_Multi`, because budget and latency errors compound |
-This is best understood as a soft-constraint MDP. Budget and SLA pressure are real and measured, but they are encoded through reward terms rather than enforced through a full constrained-MDP Lagrangian. That distinction matters. The environment is honest about tradeoffs instead of pretending the constraint design is solved.
-## What Worked
-### 1. The heuristic is a real baseline, not a strawman
-The heuristic uses public observations and chooses the cheapest viable provider. It is budget-aware and reactive. On easy settings, this is exactly the kind of policy that should be strong.
-That is important for judge trust. If a learned policy only beats random or a broken baseline, the environment is not very informative. Budget Router's baseline is good enough to make improvement nontrivial, but limited enough that cascade failure exposes its weakness.
-On `Hard_Multi`, the heuristic reaches roughly `0.61`. It is not useless; it is just too reactive for a delayed cascade.
-### 2. Zero-shot LLM routing improves because the state is semantically meaningful
-The LLM policy is not trained on Budget Router. It receives structured observations with meaningful field names:
-```text
-provider_a_status: 0.42
-budget_remaining: 0.31
-queue_backlog: 0.20
-system_latency: 0.55
-step_count: 0.60
 ```
-That matters. A language model can reason about "budget remaining," "provider status," and "latency" without gradient updates. The prompt also includes practical routing guidance: do not treat an unprobed `0.500` status as confirmed health, pay attention to trends, and avoid bankruptcy.
-The production-facing LLM policy includes a deterministic budget-safety guard. This is not hidden. It is a deliberate agentic-system pattern: use the model for high-level routing judgment, and use deterministic code for arithmetic-critical safety. Without this guard, raw LLM behavior can sometimes spend itself into the budget cliff.
-On the README's combined `Hard_Multi` evaluation, the LLM improves over the heuristic across dev, heldout, and fresh seed buckets. The important claim is not that the LLM is magical. The claim is that semantically self-describing environments let a foundation model bring useful priors to a new control problem.
-### 3. PPO proves the environment is learnable
-PPO is a small neural policy trained directly on environment interaction. It is not an LLM, and it is not the post-training story. Its role is scientific: if a small policy gradient method can improve over the heuristic, the reward signal has enough structure to optimize.
-The PPO policy uses the same environment mechanics through a Gym wrapper. The wrapper converts OpenEnv-style typed observations into arrays for Stable-Baselines3, but PPO still routes through the same `BudgetRouterEnv.step()` dynamics and grader.
-On `Hard_Multi`, PPO reaches roughly `0.69` and beats the heuristic across the reported seeds. The adaptation sub-score is the clearest mechanism: PPO learns to preserve budget early and route more effectively when the cascade arrives.
-The honest limitation is that PPO sees `step_count`. In a fixed 20-step task, it may learn a schedule keyed partly to the clock: switch away from A early, prepare for B around step 10. That is still useful environment-validation evidence, but it is not the same as proving open-ended reactive reasoning. The LLM result is the stronger evidence for in-context reactive use of semantic observations.
-## What Did Not Work
-The post-training experiments are just as important as the wins.
-### SFT: the pipeline worked, the policy did not improve enough
-We built a full supervised fine-tuning pipeline:
-1. Generate trajectories from a stronger teacher policy.
-2. Convert observations and actions into chat-style training examples.
-3. Push the dataset to Hugging Face.
-4. Train a LoRA adapter on `Qwen/Qwen2.5-1.5B-Instruct` using Hugging Face Jobs.
-5. Merge and push the model.
-6. Evaluate against the heuristic baseline.
-The operational pipeline worked. The HF Jobs flow trained and evaluated the model on GPU infrastructure. This matters for reproducibility: the fine-tuning path is not a sketch; it is runnable through `generate_sft_data.py`, `train_sft.py`, `eval_sft.py`, and `scripts/submit_sft_hf_jobs.sh`.
-But the latest SFT evaluation did not beat the heuristic. On 10 `Hard_Multi` seeds, SFT scored `0.577` vs heuristic `0.596`, winning 3/10 seeds.
-That is not a result to hide. It is the most useful negative result in the project.
-The likely reason is that behavioral cloning sees only good-looking actions, not the counterfactuals. It can learn "route to B often" or "avoid C when budget is low," but it does not directly learn why a near-miss action is bad, how budget errors compound, or when probing is worth the short-term risk.
-In Budget Router, the objective is episodic. One bad switch can erase a good early trajectory. A static label does not carry the full consequence of that decision.
-### GRPO: promising direction, not a successful result yet
-We also attempted GRPO-style reward optimization for an LLM policy. That is the more natural post-training direction for an OpenEnv agent, because the model can interact with the environment and receive reward from actual consequences.
-In our current run, GRPO did not produce a reliable improvement. The pitch notes reward trending downward, weak rollout quality, and mode collapse in the attempted setup. The practical lesson is that GRPO needs more than a valid environment wrapper. It needs enough reward variance, enough model capacity, stable rollouts, and careful exploration.
-So the honest conclusion is:
-> PPO shows the environment is learnable. Zero-shot LLM shows semantic observations are useful. SFT shows imitation alone is not enough. GRPO remains the right research direction, but not a claimed win in this submission.
-## Why This Is Still a Strong Result
-The strongest version of Budget Router is not "we found one trick that wins." It is this:
-```mermaid
-flowchart TD
-    E["OpenEnv environment<br/>partial observability + cascade failure"] --> G["Five-part grader<br/>success, latency, budget, SLA, adaptation"]
-    G --> B["Heuristic baseline<br/>cheap reactive policy"]
-    G --> L["Zero-shot LLM<br/>semantic reasoning + budget guard"]
-    G --> P["PPO<br/>reward-aware optimization"]
-    P --> S["SFT/GRPO attempts<br/>negative results and future direction"]
 ```
-Budget Router has the properties a useful post-training environment should have:
-| Property | Evidence |
-|---|---|
-| Non-trivial | Heuristic beats random but leaves headroom; oracle gap is largest on `Hard_Multi` |
-| Learnable | PPO improves over heuristic on the hardest task |
-| Semantically agentic | Zero-shot LLM improves because observations are meaningful |
-| Not trivially gameable | Always-shed and always-expensive policies are penalized |
-| Reproducible | README and `REPRODUCIBILITY.md` describe seed buckets, traces, saved JSON, and command paths |
-| Honest | SFT and GRPO attempts are reported without overstating them |
-That combination is rare in hackathon environments. Many environments are easy to demo but hard to falsify. Budget Router is designed to be falsified: run the seeds, inspect the traces, compare sub-scores, and check whether improvement comes from adaptation rather than a loophole.
-## Reproducibility
-The repo is structured so judges can inspect both aggregate results and exact behavior.
-Key artifacts:
-- `README.md`: headline benchmark tables and evidence figure.
-- `REPRODUCIBILITY.md`: command checklist and falsification guide.
-- `eval/eval_all.py`: heuristic vs LLM evaluation across task and seed buckets.
-- `eval/trace_episode.py`: step-by-step episode traces.
-- `train/eval_hard_multi.py`: PPO evaluation path.
-- `generate_sft_data.py`: SFT dataset generation from teacher trajectories.
-- `train_sft.py`: LoRA SFT training script for Hugging Face Jobs.
-- `eval_sft.py`: SFT model evaluation against the heuristic.
-- `scripts/submit_sft_hf_jobs.sh`: orchestration for data, training, and evaluation jobs.
-For the SFT pipeline, the intended run looks like:
 ```bash
-export TEACHER_POLICY=ppo
-export HF_JOB_FLAVOR=a10g-large
-export HF_JOB_NAMESPACE=akshay4
-export DATASET_REPO=akshay4/budget-router-sft-data
-export OUTPUT_REPO=akshay4/budget-router-sft-qwen1.5b
-export SFT_MODEL_REPO=$OUTPUT_REPO
-export SFT_N_EPISODES=100
-export SFT_TOP_FRACTION=0.30
-export NUM_EPOCHS=3
-export N_SEEDS=10
-./scripts/submit_sft_hf_jobs.sh
 ```
-The important point is not that this SFT model won. It did not. The important point is that the environment can produce training data, launch model training, push artifacts, and evaluate the resulting policy. That closes the environment-to-training-to-evaluation loop, even when the experimental result is negative.
-## The Research Lesson
-Budget Router is a reminder that post-training methods should match the task.
-For static classification, supervised fine-tuning may be enough. For sequential decision-making under budget constraints, static imitation is often too weak. The agent needs to learn from consequences: what happens after a risky fallback, what happens when it fails to probe, what happens when it saves budget early, and what happens when it arrives at the cascade with no runway left.
-That is why PPO worked better than SFT here. PPO receives feedback from the environment. It optimizes the episode objective directly. The zero-shot LLM also performs well because it brings external priors about risk, cost, and reliability to a semantically described state.
-The next research step is not to pretend SFT solved the problem. It is to use SFT as a warm start or distillation layer, then apply environment-aware RL with better rollout diversity and reward normalization.
-## Conclusion
-Budget Router is an incident-commander environment for budgeted API reliability. It asks a simple question with real consequences:
-> When providers degrade and budget is running out, can an agent adapt before the cascade breaks the system?
-The answer from our experiments is nuanced:
-- hand-coded rules are strong but brittle,
-- zero-shot LLM reasoning helps when the observation schema is meaningful,
-- PPO confirms the environment has a learnable reward signal,
-- SFT and GRPO are not claimed wins, but they reveal where the hard part actually is.
-That is the story we think is worth submitting: a reproducible environment, a real baseline, measurable improvement, and enough intellectual honesty that the failures make the benchmark more credible rather than less.

+---
+title: "Budget Router"
+emoji: "⚙️"
+colorFrom: purple
+colorTo: indigo
+sdk: docker
+app_port: 8000
+base_path: /web
+pinned: false
+---
+# Budget Router (OpenEnv)
+Budget Router is an OpenEnv-compliant RL environment where an agent routes requests to one of three providers (A/B/C) or sheds load under a tight **cost–reliability–SLA** trade-off. Providers degrade non-stationarily within an episode; the agent observes only a noisy windowed success signal (rolling success rate), not true internal health.
+[![HF Space](https://img.shields.io/badge/🤗-Live%20Demo-yellow)](https://huggingface.co/spaces/akshay4/budget-router-openenv)
+[![Demo Video](https://img.shields.io/badge/YouTube-Demo-red)](https://youtu.be/Z1A2zND_x70)
 ## TL;DR
+**Hard_Multi is the headline scenario**: when Provider A degrades from step 0 and
+Provider B cascades at step 10, reactive policies go negative while adaptive ones
+stay positive. Three policy families, each stronger than the last, validated
+across **30 paired seeds** in three independent buckets (dev, heldout, fresh):
+| Policy | Hard_Multi grader | vs heuristic | Statistical evidence |
+|---|---:|---|---|
+| Heuristic (reactive) | 0.6076 ± 0.0361 (n=30) | — | — |
+| LLM — Qwen2.5-72B + budget-guard | 0.6515 ± 0.0523 (n=30) | **+7.2 %** | Cohen's d = **1.135** (large), paired one-sided p < 1×10⁻⁶, 24/30 wins, bootstrap 95 % CI on Δ = [0.031, 0.058] |
+| PPO — SB3, 100k steps | **0.6907 ± 0.0326** (n=10 dev) | **+13.6 %** | 95 % CI [0.667, 0.714], **non-overlapping with heuristic**, 10/10 wins |
+**Mechanism** (PPO): the agent learned to route A→B early and conserve budget
+before B's cascade at step 10, pushing `adaptation_score` from 0.6907 (heuristic)
+to **0.9328** — a +0.2421 gain on the grader's most diagnostic sub-score. The
+LLM achieves a milder version of the same effect (+0.124 adaptation gain
+across n=30) by anticipating the cascade in-context.
+**Environment hardness**: heuristic reward goes negative (−2.97) on
+Hard_Multi while oracle reaches +4.10 — a 7.07-point gap (≈238 % of the
+heuristic's absolute reward) that confirms the cascade task is hard enough
+to require RL/in-context reasoning and learnable enough to reward it.
+**Honest scope** (explicitly disclosed):
+- The LLM uses a deterministic **budget-safety guard** that vetoes routes which
+  would bankrupt the budget — a standard agentic-system pattern (LLM for
+  high-level decisions, deterministic layer for arithmetic-critical safety).
+  Without the guard, raw LLM occasionally exhausts budget and incurs the −10
+  cliff penalty.
+- LLM (with guard) wins on **3 of 4 task tiers**: Medium (+5.8 %), Hard (+7.5 %),
+  Hard_Multi (+11.0 %). Loses Easy by −4.6 % — on a task with no degradation,
+  the budget-conservative heuristic is near-optimal and the LLM's added
+  flexibility is unhelpful.
+- PPO is trained and evaluated on **Hard_Multi only**; not a general-purpose
+  policy. This is a deliberate choice — Hard_Multi has a 238 % oracle/heuristic
+  gap, the largest in the suite, so RL signal is highest there.
+- All non-trivial improvement claims come from seeds the policy never saw
+  during design (heldout 100–109, fresh 200–209). Dev-seed wins are reported
+  separately and never used to make the headline claim.
+## Run locally
+**Enable LLM policy locally**:
+```bash
+export API_BASE_URL="https://<openai-compatible-endpoint>/v1"  # e.g. https://router.huggingface.co/v1
+export API_KEY="<your_key>"
+export MODEL_NAME="<model_id>"  # optional (e.g. Qwen/Qwen2.5-72B-Instruct)
 ```
+```bash
+uv sync --extra training
+uv run server
+```
+Then open `http://127.0.0.1:8000/web` for the Gradio dashboard.
+**REPRODUCIBLE RESULTS: use [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md) as the source-of-truth command checklist for evaluation numbers, traces, PPO workflow, and optional GRPO/SFT checks.**
+**IF THE HUGGING FACE SPACE OR HF JOBS CODE PATH FAILS, RUN THE GITHUB/LOCAL CODE DIRECTLY FROM [`akshay-babbar/budget-router-openenv`](https://github.com/akshay-babbar/budget-router-openenv). THE GITHUB CODE IS THE MOST UP-TO-DATE VERSION.**
+## Benchmark results
+Three policies evaluated:
+- **Heuristic**: budget-aware, cheapest-viable baseline using only public
+  observations (`budget_router/policies.py`).
+- **LLM**: Qwen2.5-72B via HuggingFace Inference Router, wrapped with a
+  deterministic budget-safety guard (`inference.py::_apply_budget_safety_guard`).
+- **PPO**: MlpPolicy trained with Stable-Baselines3 on Hard_Multi (100k steps,
+  4 parallel envs). See `train/train_ppo_hard_multi.py`.
+- **Oracle†**: privileged upper-bound with internal-state access,
+  validation-only, not reported in tables.
+**Dev seeds (0–9), full task suite** — `outputs/freeze_check_alltasks_dev10/eval_summary_*.md`:
+| Task | Heuristic | LLM | PPO | LLM Δ vs heuristic |
+|---|---:|---:|---:|---|
+| Easy | 0.7718 | 0.7360 | — | −4.6 %  *(7 losses, 0 wins, 3 ties)* |
+| Medium | 0.6852 | 0.7250 | — | **+5.8 %**  *(9 wins, 0 losses, 1 tie)* |
+| Hard | 0.6354 | 0.6832 | — | **+7.5 %**  *(8 wins, 2 losses, 0 ties)* |
+| Hard_Multi | 0.6078 | 0.6746 | **0.6907** | **+11.0 %**  *(8 wins, 1 loss, 1 tie)* |
+PPO was trained and evaluated on Hard_Multi only; Easy/Medium/Hard cells are
+intentionally blank (no model for those tasks).
+**Statistical evidence — Hard_Multi** (`outputs/freeze_check_*/eval_results_*.json`,
+`outputs/ppo_hard_multi_eval.json`):
+| | Heuristic | LLM | PPO |
+|---|---|---|---|
+| Mean grader | 0.6076 ± 0.0361 (n=30) | 0.6515 ± 0.0523 (n=30) | 0.6907 ± 0.0326 (n=10) |
+| Bootstrap 95 % CI | [0.595, 0.620] | [0.633, 0.670] | [0.667, 0.714] |
+| Paired Δ vs heuristic | — | +0.0440 (boot 95 % CI [0.031, 0.058]) | +0.0829 |
+| **Cohen's d (paired)** | — | **1.135  (LARGE)** | **≈ 2.4  (HUGE)** |
+| Paired one-sided p | — | **< 1 × 10⁻⁶** (Welch t = 6.22, df = 29) | (10/10 wins) |
+| Sign-test wins / ties / losses | — | **24 / 3 / 3** | 10 / 0 / 0 |
+| P(LLM > heuristic) — Agarwal 2021 | — | **0.80** | 1.00 |
+| IQM of paired Δ — Agarwal 2021 | — | +0.040 (trimmed 25 %) | — |
+| 95 % CI overlap with heuristic | — | None on the Δ | **None on the means** |
+| Adaptation sub-score (mean) | 0.6878 | 0.8115 | **0.9328** |
+**Per-bucket reproduction** (each row independent; LLM and heuristic share seeds,
+so deltas are paired):
+| Bucket | Seeds | Heuristic | LLM | Δ (rel %) | Wins / Ties / Losses |
+|---|---|---:|---:|---:|---:|
+| Dev | 0–9 | 0.6078 ± 0.0382 | 0.6746 ± 0.0486 | +0.0668 (+11.0 %) | 8 / 1 / 1 |
+| **Heldout** | 100–109 | 0.6064 ± 0.0419 | 0.6454 ± 0.0497 | **+0.0390 (+6.4 %)** | **8 / 2 / 0** |
+| **Fresh** | 200–209 | 0.6086 ± 0.0314 | 0.6347 ± 0.0551 | **+0.0261 (+4.3 %)** | **8 / 0 / 2** |
+| **Combined non-dev** | 100–109 + 200–209 | 0.6075 | 0.6401 | **+0.0326 (+5.4 %)** | **16 / 2 / 2** |
+![Budget Router Evidence](figures/budget_router_evidence.png)
+*Figure: (top-left) LLM advantage grows with task difficulty; (top-right)
+three-policy ordering on Hard_Multi with non-overlapping 95% CIs;
+(bottom-left) generalization across independent seed buckets including
+post-freeze fresh seeds; (bottom-right) adaptation sub-score is the
+primary driver of LLM and PPO gains over the reactive heuristic.*
+The fresh-seed bucket (200–209) was added *after* the LLM prompt and budget
+guard were frozen. It exists specifically to falsify a "tuned-on-heldout"
+critique. The effect persists with no overlap to zero in the bootstrap CI.
+<details>
+<summary>🔬 Reproducing PPO Results (Optional)</summary>
+The trained PPO policy for the hard_multi scenario is included at
+`trained_models/ppo_hard_multi_100k.zip` (143 KB, trained 100k steps).
+To reproduce the 10-seed evaluation locally:
+```bash
+# Install dependencies
+uv sync --extra training
+# Run evaluation (writes to outputs/ppo_hard_multi_eval.json)
+uv run python train/eval_hard_multi.py
 ```
+Expected output: PPO mean = 0.691 ± 0.033 vs Heuristic mean = 0.608 ± 0.038,
+win_rate = 1.0 (10/10 seeds), non-overlapping 95 % CIs.
+> The deployed `inference.py` uses the LLM policy (Qwen2.5-72B + budget guard)
+> as required by the hackathon specification. PPO was trained offline to
+> validate environment depth and demonstrate that the task rewards genuine
+> RL learning beyond reactive or in-context policies.
+</details>
+<details>
+<summary>🔬 Reproducing LLM rigorous-stats Results (Optional)</summary>
+```bash
+# Dev (seeds 0-9), full task suite
+uv run python eval/eval_all.py \
+  --tasks easy --tasks medium --tasks hard --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seeds 10 --seed-set dev \
+  --out-dir outputs/freeze_check_alltasks_dev10
+# Heldout (seeds 100-109), Hard_Multi
+uv run python eval/eval_all.py \
+  --tasks hard_multi --policies heuristic --policies llm \
+  --seeds 10 --seed-set heldout \
+  --out-dir outputs/freeze_check_heldout10
+# Fresh (seeds 200-209), Hard_Multi — uses --seed-values for arbitrary seeds
+uv run python eval/eval_all.py \
+  --tasks hard_multi --policies heuristic --policies llm \
+  --seed-values "200,201,202,203,204,205,206,207,208,209" \
+  --out-dir outputs/freeze_check_fresh_200_209
+```
+All three runs combined produce the n=30 rigorous-stats table above.
+Episode-level JSON (per-step actions, rewards, sub-scores) is preserved
+under each `outputs/freeze_check_*/` directory.
+</details>
+## Why this benchmark has substance
+- **Partial observability**: the agent-visible observation contains only `provider_a/b/c_status`, `budget_remaining`, `queue_backlog`, `system_latency`, and `step_count` (`budget_router/models.py`). True provider health is internal.
+- **Non-stationarity**: task difficulty is created by explicit degradation schedules, culminating in Hard_Multi where A degrades from step 0 and B degrades from step 10 (`budget_router/tasks.py`).
+- **Coupled constraints**: queue backlog amplifies latency, so routing errors create downstream SLA pressure rather than just local failures (`budget_router/environment.py`).
+- **Meaningful evaluation**: the grader separately scores success, latency, budget, SLA, and adaptation; for Hard_Multi, adaptation is explicitly split across the two degradation windows (`budget_router/reward.py`).
+- **RL learnability confirmed**: a PPO agent trained from scratch in 100k steps
+  achieves non-overlapping 95 % CIs above the heuristic on Hard_Multi
+  (`train/eval_hard_multi.py`), confirming the cascade signal is learnable
+  beyond reactive or in-context policies.
+- **Anti-gaming, anti-overfitting tested**: 41 unit tests + 36 hard validation
+  assertions including degenerate-policy guards (always-A, always-B, always-shed
+  all dominated by baseline), grader-exploit guards (pure abstention scores
+  below 0.40 on Easy), heldout stability checks, and zero-NaN/zero-crash
+  invariants across 315 episodes.
+### Oracle–Baseline reward gap (verified, n=10 seeds each, dev set)
+| Scenario | Oracle† | Heuristic | Gap | Signal |
+|---|---|---|---|---|
+| Easy | +10.10 | +6.98 | 3.12 (45 %) | Heuristic competitive |
+| Medium | +9.49 | +2.53 | 6.96 (275 %) | Meaningful headroom |
+| Hard | +6.54 | +0.88 | 5.66 (643 %) | Heuristic nearly fails |
+| **Hard_Multi** | **+4.10** | **−2.97** | **7.07 (238 % of \|baseline\|)** | **Heuristic actively harmful** |
+*† Oracle has privileged access to internal provider health — theoretical ceiling only, not a deployable policy.*
+On Hard_Multi the heuristic reward goes negative (−2.97): the rule-based
+policy exhausts budget mid-cascade and actively destroys episode value.
+Oracle stays strongly positive (+4.10). The 7.07-point gap — 238 % above the
+heuristic's absolute reward — is what produces the large advantage signal that
+allows PPO to find a meaningful gradient in 100k steps and the LLM to find a
+Cohen's-d ≈ 1.1 effect zero-shot.
+```mermaid
+flowchart LR
+    subgraph Policy["Policy Layer"]
+        H["Heuristic"]
+        L["LLM (Qwen2.5-72B + budget guard)"]
+        P["PPO (SB3, Hard_Multi)"]
+    end
+    subgraph Env["BudgetRouterEnv (OpenEnv)"]
+        direction TB
+        O["Observation: provider_statuses, budget, backlog, latency, step"]
+        A["Actions: route_to_a, route_to_b, route_to_c, shed_load"]
+        R["Reward: success/fail + cost + SLA penalty, -10 on budget exhaustion"]
+        G["Episode grader: success, adaptation, latency, budget, SLA"]
+        O --> A --> R --> G
+    end
+    subgraph Tasks["Task presets"]
+        E["Easy"]
+        M["Medium"]
+        Hd["Hard"]
+        HM["Hard_Multi (cascade)"]
+    end
+    Policy -->|"action"| Env
+    Env -->|"obs + reward"| Policy
+    Tasks -->|"scenario config"| Env
+```
+## Tasks (what changes across difficulty)
+| Task | Budget ($) | Degradation schedule |
+|---|---:|---|
+| Easy | 1.00 | None (`degradation_start_step=999`) |
+| Medium | 0.95 | A degrades after step 5 (`rate=0.15`) |
+| Hard | 0.85 | A degrades from step 0 (`rate=0.15`) |
+| Hard_Multi | 1.10 | A degrades from step 0 (`rate=0.12`), then B from step 10 (`rate=0.10`) |
+Hard_Multi is the headline scenario: once B starts degrading at step 10, C becomes the only consistently reliable option. Since `cost_c=$0.10/request`, the final 10 steps alone can consume `$1.00` of the `$1.10` budget, making **early budget conservation** a binding constraint.
+## Grader (episode score)
+The episode grader is a weighted score in `[0,1]`:
+`overall = 0.30·success + 0.20·latency + 0.15·budget + 0.15·SLA + 0.20·adaptation`
+Notes (from `budget_router/reward.py`):
+- `success_score` is computed over **all episode steps** (shed-load/abstention is penalized).
+- `adaptation_score` evaluates post-degradation success. For Hard_Multi it is a blended window: 0.5×(after A degrades, before B) + 0.5×(after B degrades).
+## Evaluation protocol (reproducibility)
+- **Three independent seed buckets**: dev (0–9) used during policy design;
+  heldout (100–109) used to falsify dev-seed overfitting; fresh (200–209)
+  added *after* the LLM and PPO were frozen to falsify "tuned-on-heldout"
+  concerns. See `eval/eval_all.py::SEED_SETS` and the `--seed-values` CLI
+  option for arbitrary seed lists.
+- **Scripted runs**: `eval/eval_all.py` writes timestamped artifacts under
+  `outputs/`. Per-episode JSON includes per-step `actions`, `rewards`, and
+  the full grader sub-score breakdown.
+- **Statistical reporting**: We report Cohen's d, paired Welch t-test,
+  bootstrap 95 % confidence intervals, IQM, and probability of improvement
+  in line with [Agarwal et al. 2021 (NeurIPS Outstanding Paper)](https://arxiv.org/abs/2108.13264)
+  and [Henderson et al. 2018](https://arxiv.org/abs/1709.06560)'s reproducibility
+  recommendations. Sample size n=30 (combined buckets) exceeds the Colas
+  et al. 2018 recommended power-analysis floor for our observed effect size.
+- **Anti-cheating tests**: `budget_router/tests/test_environment.py::TestGraderSemantics`
+  verifies that pure abstention scores below 0.40 on Easy and that
+  partial abstention always scores worse than full service.
+## Getting started
+1. Install dependencies:
+```bash
+uv sync
 ```
+2. (Optional, for LLM policy) set an OpenAI-compatible endpoint:
 ```bash
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+export HF_TOKEN=...   # or API_KEY
 ```
+3. Run evaluation (writes to `outputs/`):
+```bash
+# Single-task heldout reproduction
+uv run python eval/eval_all.py \
+  --tasks hard_multi --seed-set heldout --seeds 10 \
+  --policies heuristic --policies llm \
+  --out-dir outputs/heldout_repro
+# Full task suite, dev
+uv run python eval/eval_all.py \
+  --tasks easy --tasks medium --tasks hard --tasks hard_multi \
+  --policies heuristic --policies llm \
+  --seeds 10 --seed-set dev \
+  --out-dir outputs/dev_repro
+```
+## References
+- Altman (1999): *Constrained Markov Decision Processes*.
+- Henderson, Islam, Bachman, Pineau, Precup, Meger ([arXiv:1709.06560](https://arxiv.org/abs/1709.06560), AAAI 2018): *Deep Reinforcement Learning that Matters* — foundational reproducibility study; motivated multi-bucket seed evaluation here.
+- Colas, Sigaud, Oudeyer ([arXiv:1806.08295](https://arxiv.org/abs/1806.08295), 2018): *How Many Random Seeds? Statistical Power Analysis in Deep RL Experiments* — power-analysis basis for n=30.
+- Agarwal, Schwarzer, Castro, Courville, Bellemare ([arXiv:2108.13264](https://arxiv.org/abs/2108.13264), NeurIPS 2021 Outstanding Paper): *Deep RL at the Edge of the Statistical Precipice* — IQM, bootstrap CIs, probability-of-improvement adopted in the statistical-evidence table.