akshay4 commited on
Commit
7e81fc2
·
verified ·
1 Parent(s): 00de9ad

Upload blog.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. blog.md +282 -196
blog.md CHANGED
@@ -1,257 +1,343 @@
1
- # Budget Router: Teaching Agents to Survive Cascading API Failures Under Budget
 
 
 
 
 
 
 
 
 
2
 
3
- Production AI systems do not fail politely.
4
 
5
- An application may depend on several LLM or API providers, each with different cost, latency, and reliability profiles. One provider becomes flaky. Traffic shifts. The next fallback becomes overloaded or starts degrading. The system still has a budget, users still expect latency, and the router never sees the true internal health of the providers. It only sees noisy public signals: recent success rates, backlog, latency, and remaining budget.
6
 
7
- That is the problem Budget Router is built to study.
8
-
9
- Budget Router is an OpenEnv-compliant reinforcement learning environment where an agent routes each request to Provider A, B, C, or sheds load. A is cheap, B is moderate, C is reliable but expensive. The agent's job is not simply to pick the best provider now. It must preserve enough budget to survive what happens later.
10
-
11
- The interesting case is `Hard_Multi`: Provider A degrades from the beginning, and Provider B cascades later in the episode. This creates a two-phase incident. A naive router can look reasonable early and still fail late because it spent too much budget before the real cascade arrived.
12
-
13
- This is a small environment, but it captures a real infrastructure question:
14
-
15
- > Can an agent learn budget-aware reliability behavior under partial observability and non-stationary provider degradation?
16
 
17
  ## TL;DR
18
 
19
- Budget Router is not a claim that a 20-step toy simulation is production routing. It is a compact, reproducible benchmark for a production-shaped failure mode: budgeted API routing under cascading degradation.
20
-
21
- On the headline `Hard_Multi` task, we compare three policy families:
22
-
23
- | Policy | What it is | Hard_Multi grader | Main takeaway |
24
- |---|---|---:|---|
25
- | Heuristic | Hand-coded reactive baseline | ~0.61 | A real baseline, but brittle under cascade failure |
26
- | Zero-shot LLM | Qwen2.5-72B with a deterministic budget guard | ~0.65 | In-context reasoning helps when observations are semantically meaningful |
27
- | PPO | Small SB3 MLP trained on the environment | ~0.69 | The reward signal is learnable and stronger than hand rules |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- ```mermaid
30
- flowchart LR
31
- H["Heuristic baseline<br/>0.61<br/>hand-coded rules"] --> L["Zero-shot LLM<br/>0.65<br/>Qwen2.5-72B + budget guard"]
32
- L --> P["Trained PPO<br/>0.69<br/>SB3 MLP, 100k steps"]
33
  ```
34
 
35
- We also ran post-training experiments beyond PPO:
36
-
37
- - SFT on Qwen2.5-1.5B via Hugging Face Jobs completed end-to-end, but did **not** beat the heuristic on the latest 10-seed evaluation: `0.577` vs `0.596`, with 3/10 wins.
38
- - GRPO was attempted, but did not converge reliably in our setup.
39
- - The negative result is useful: this environment rewards sequential credit assignment, probing, recovery, and budget conservation. Plain behavioral cloning can imitate action patterns without learning why those actions matter.
40
-
41
- ![Budget Router evidence](figures/budget_router_evidence.png)
42
-
43
- *Figure: README evidence summary. The strongest claims are the three-policy ordering on `Hard_Multi`, heldout/fresh seed generalization for the LLM, and adaptation-score gains over the reactive heuristic.*
44
 
45
- ## The Environment
46
-
47
- Budget Router exposes a simple action space:
48
-
49
- - `route_to_a`
50
- - `route_to_b`
51
- - `route_to_c`
52
- - `shed_load`
53
 
54
- The observation is intentionally public and partial. The policy sees:
55
 
56
- - rolling provider success estimates,
57
- - remaining budget,
58
- - queue backlog,
59
- - system latency,
60
- - episode progress.
61
 
62
- It does **not** see the true hidden provider health. This makes the problem a partially observable decision problem rather than a lookup table. The agent has to infer whether a provider is actually degrading or whether it just saw noise.
63
 
64
- The task suite escalates difficulty:
65
 
66
- | Task | Degradation pattern | Why it matters |
67
- |---|---|---|
68
- | `Easy` | No degradation | Budget-conservative rules are hard to beat |
69
- | `Medium` | A degrades after step 5 | Reactive switching begins to matter |
70
- | `Hard` | A degrades from step 0 | Early adaptation matters |
71
- | `Hard_Multi` | A degrades from step 0, B from step 10 | Cascade failure forces budget-aware anticipation |
72
 
73
- `Hard_Multi` is the core benchmark. If the router burns money on expensive fallbacks too early, it may have no budget left when B starts failing. If it stays cheap for too long, it loses success and SLA. If it sheds load too often, it avoids cost but fails the user.
74
 
75
- That is the point: there is no single dominant action.
 
 
 
 
 
 
 
76
 
77
- ## The Grader
78
 
79
- The episode grader is a weighted score in `[0, 1]`:
 
 
 
 
 
80
 
81
- ```text
82
- overall = 0.30 * success
83
- + 0.20 * latency
84
- + 0.15 * budget
85
- + 0.15 * SLA
86
- + 0.20 * adaptation
87
- ```
88
 
89
- The grader is designed so that obvious reward hacks are unattractive:
 
90
 
91
- | Shortcut | Why it fails |
92
- |---|---|
93
- | Always route to C | Good latency, but expensive and budget-risky |
94
- | Always shed load | Avoids cost, but earns no success or adaptation |
95
- | Always use A | Cheap, but collapses once A degrades |
96
- | Switch only after failure | Too late in `Hard_Multi`, because budget and latency errors compound |
 
 
 
 
 
 
97
 
98
- This is best understood as a soft-constraint MDP. Budget and SLA pressure are real and measured, but they are encoded through reward terms rather than enforced through a full constrained-MDP Lagrangian. That distinction matters. The environment is honest about tradeoffs instead of pretending the constraint design is solved.
 
99
 
100
- ## What Worked
 
 
 
 
 
101
 
102
- ### 1. The heuristic is a real baseline, not a strawman
 
 
 
 
 
103
 
104
- The heuristic uses public observations and chooses the cheapest viable provider. It is budget-aware and reactive. On easy settings, this is exactly the kind of policy that should be strong.
 
 
105
 
106
- That is important for judge trust. If a learned policy only beats random or a broken baseline, the environment is not very informative. Budget Router's baseline is good enough to make improvement nontrivial, but limited enough that cascade failure exposes its weakness.
 
107
 
108
- On `Hard_Multi`, the heuristic reaches roughly `0.61`. It is not useless; it is just too reactive for a delayed cascade.
 
109
 
110
- ### 2. Zero-shot LLM routing improves because the state is semantically meaningful
111
 
112
- The LLM policy is not trained on Budget Router. It receives structured observations with meaningful field names:
 
 
113
 
114
- ```text
115
- provider_a_status: 0.42
116
- budget_remaining: 0.31
117
- queue_backlog: 0.20
118
- system_latency: 0.55
119
- step_count: 0.60
120
  ```
121
 
122
- That matters. A language model can reason about "budget remaining," "provider status," and "latency" without gradient updates. The prompt also includes practical routing guidance: do not treat an unprobed `0.500` status as confirmed health, pay attention to trends, and avoid bankruptcy.
123
-
124
- The production-facing LLM policy includes a deterministic budget-safety guard. This is not hidden. It is a deliberate agentic-system pattern: use the model for high-level routing judgment, and use deterministic code for arithmetic-critical safety. Without this guard, raw LLM behavior can sometimes spend itself into the budget cliff.
125
-
126
- On the README's combined `Hard_Multi` evaluation, the LLM improves over the heuristic across dev, heldout, and fresh seed buckets. The important claim is not that the LLM is magical. The claim is that semantically self-describing environments let a foundation model bring useful priors to a new control problem.
127
-
128
- ### 3. PPO proves the environment is learnable
129
-
130
- PPO is a small neural policy trained directly on environment interaction. It is not an LLM, and it is not the post-training story. Its role is scientific: if a small policy gradient method can improve over the heuristic, the reward signal has enough structure to optimize.
131
-
132
- The PPO policy uses the same environment mechanics through a Gym wrapper. The wrapper converts OpenEnv-style typed observations into arrays for Stable-Baselines3, but PPO still routes through the same `BudgetRouterEnv.step()` dynamics and grader.
133
-
134
- On `Hard_Multi`, PPO reaches roughly `0.69` and beats the heuristic across the reported seeds. The adaptation sub-score is the clearest mechanism: PPO learns to preserve budget early and route more effectively when the cascade arrives.
135
 
136
- The honest limitation is that PPO sees `step_count`. In a fixed 20-step task, it may learn a schedule keyed partly to the clock: switch away from A early, prepare for B around step 10. That is still useful environment-validation evidence, but it is not the same as proving open-ended reactive reasoning. The LLM result is the stronger evidence for in-context reactive use of semantic observations.
 
 
 
137
 
138
- ## What Did Not Work
139
 
140
- The post-training experiments are just as important as the wins.
 
141
 
142
- ### SFT: the pipeline worked, the policy did not improve enough
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
- We built a full supervised fine-tuning pipeline:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
- 1. Generate trajectories from a stronger teacher policy.
147
- 2. Convert observations and actions into chat-style training examples.
148
- 3. Push the dataset to Hugging Face.
149
- 4. Train a LoRA adapter on `Qwen/Qwen2.5-1.5B-Instruct` using Hugging Face Jobs.
150
- 5. Merge and push the model.
151
- 6. Evaluate against the heuristic baseline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
- The operational pipeline worked. The HF Jobs flow trained and evaluated the model on GPU infrastructure. This matters for reproducibility: the fine-tuning path is not a sketch; it is runnable through `generate_sft_data.py`, `train_sft.py`, `eval_sft.py`, and `scripts/submit_sft_hf_jobs.sh`.
154
 
155
- But the latest SFT evaluation did not beat the heuristic. On 10 `Hard_Multi` seeds, SFT scored `0.577` vs heuristic `0.596`, winning 3/10 seeds.
 
 
 
 
 
156
 
157
- That is not a result to hide. It is the most useful negative result in the project.
158
 
159
- The likely reason is that behavioral cloning sees only good-looking actions, not the counterfactuals. It can learn "route to B often" or "avoid C when budget is low," but it does not directly learn why a near-miss action is bad, how budget errors compound, or when probing is worth the short-term risk.
160
 
161
- In Budget Router, the objective is episodic. One bad switch can erase a good early trajectory. A static label does not carry the full consequence of that decision.
162
 
163
- ### GRPO: promising direction, not a successful result yet
164
 
165
- We also attempted GRPO-style reward optimization for an LLM policy. That is the more natural post-training direction for an OpenEnv agent, because the model can interact with the environment and receive reward from actual consequences.
166
 
167
- In our current run, GRPO did not produce a reliable improvement. The pitch notes reward trending downward, weak rollout quality, and mode collapse in the attempted setup. The practical lesson is that GRPO needs more than a valid environment wrapper. It needs enough reward variance, enough model capacity, stable rollouts, and careful exploration.
 
168
 
169
- So the honest conclusion is:
170
 
171
- > PPO shows the environment is learnable. Zero-shot LLM shows semantic observations are useful. SFT shows imitation alone is not enough. GRPO remains the right research direction, but not a claimed win in this submission.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
- ## Why This Is Still a Strong Result
174
 
175
- The strongest version of Budget Router is not "we found one trick that wins." It is this:
176
 
177
- ```mermaid
178
- flowchart TD
179
- E["OpenEnv environment<br/>partial observability + cascade failure"] --> G["Five-part grader<br/>success, latency, budget, SLA, adaptation"]
180
- G --> B["Heuristic baseline<br/>cheap reactive policy"]
181
- G --> L["Zero-shot LLM<br/>semantic reasoning + budget guard"]
182
- G --> P["PPO<br/>reward-aware optimization"]
183
- P --> S["SFT/GRPO attempts<br/>negative results and future direction"]
184
  ```
185
 
186
- Budget Router has the properties a useful post-training environment should have:
187
-
188
- | Property | Evidence |
189
- |---|---|
190
- | Non-trivial | Heuristic beats random but leaves headroom; oracle gap is largest on `Hard_Multi` |
191
- | Learnable | PPO improves over heuristic on the hardest task |
192
- | Semantically agentic | Zero-shot LLM improves because observations are meaningful |
193
- | Not trivially gameable | Always-shed and always-expensive policies are penalized |
194
- | Reproducible | README and `REPRODUCIBILITY.md` describe seed buckets, traces, saved JSON, and command paths |
195
- | Honest | SFT and GRPO attempts are reported without overstating them |
196
-
197
- That combination is rare in hackathon environments. Many environments are easy to demo but hard to falsify. Budget Router is designed to be falsified: run the seeds, inspect the traces, compare sub-scores, and check whether improvement comes from adaptation rather than a loophole.
198
-
199
- ## Reproducibility
200
-
201
- The repo is structured so judges can inspect both aggregate results and exact behavior.
202
-
203
- Key artifacts:
204
-
205
- - `README.md`: headline benchmark tables and evidence figure.
206
- - `REPRODUCIBILITY.md`: command checklist and falsification guide.
207
- - `eval/eval_all.py`: heuristic vs LLM evaluation across task and seed buckets.
208
- - `eval/trace_episode.py`: step-by-step episode traces.
209
- - `train/eval_hard_multi.py`: PPO evaluation path.
210
- - `generate_sft_data.py`: SFT dataset generation from teacher trajectories.
211
- - `train_sft.py`: LoRA SFT training script for Hugging Face Jobs.
212
- - `eval_sft.py`: SFT model evaluation against the heuristic.
213
- - `scripts/submit_sft_hf_jobs.sh`: orchestration for data, training, and evaluation jobs.
214
-
215
- For the SFT pipeline, the intended run looks like:
216
 
217
  ```bash
218
- export TEACHER_POLICY=ppo
219
- export HF_JOB_FLAVOR=a10g-large
220
- export HF_JOB_NAMESPACE=akshay4
221
- export DATASET_REPO=akshay4/budget-router-sft-data
222
- export OUTPUT_REPO=akshay4/budget-router-sft-qwen1.5b
223
- export SFT_MODEL_REPO=$OUTPUT_REPO
224
- export SFT_N_EPISODES=100
225
- export SFT_TOP_FRACTION=0.30
226
- export NUM_EPOCHS=3
227
- export N_SEEDS=10
228
-
229
- ./scripts/submit_sft_hf_jobs.sh
230
  ```
231
 
232
- The important point is not that this SFT model won. It did not. The important point is that the environment can produce training data, launch model training, push artifacts, and evaluate the resulting policy. That closes the environment-to-training-to-evaluation loop, even when the experimental result is negative.
233
-
234
- ## The Research Lesson
235
-
236
- Budget Router is a reminder that post-training methods should match the task.
237
-
238
- For static classification, supervised fine-tuning may be enough. For sequential decision-making under budget constraints, static imitation is often too weak. The agent needs to learn from consequences: what happens after a risky fallback, what happens when it fails to probe, what happens when it saves budget early, and what happens when it arrives at the cascade with no runway left.
239
-
240
- That is why PPO worked better than SFT here. PPO receives feedback from the environment. It optimizes the episode objective directly. The zero-shot LLM also performs well because it brings external priors about risk, cost, and reliability to a semantically described state.
241
 
242
- The next research step is not to pretend SFT solved the problem. It is to use SFT as a warm start or distillation layer, then apply environment-aware RL with better rollout diversity and reward normalization.
243
-
244
- ## Conclusion
245
-
246
- Budget Router is an incident-commander environment for budgeted API reliability. It asks a simple question with real consequences:
247
-
248
- > When providers degrade and budget is running out, can an agent adapt before the cascade breaks the system?
249
-
250
- The answer from our experiments is nuanced:
 
 
 
 
 
251
 
252
- - hand-coded rules are strong but brittle,
253
- - zero-shot LLM reasoning helps when the observation schema is meaningful,
254
- - PPO confirms the environment has a learnable reward signal,
255
- - SFT and GRPO are not claimed wins, but they reveal where the hard part actually is.
256
 
257
- That is the story we think is worth submitting: a reproducible environment, a real baseline, measurable improvement, and enough intellectual honesty that the failures make the benchmark more credible rather than less.
 
 
 
 
1
+ ---
2
+ title: "Budget Router"
3
+ emoji: "⚙️"
4
+ colorFrom: purple
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 8000
8
+ base_path: /web
9
+ pinned: false
10
+ ---
11
 
12
+ # Budget Router (OpenEnv)
13
 
14
+ Budget Router is an OpenEnv-compliant RL environment where an agent routes requests to one of three providers (A/B/C) or sheds load under a tight **cost–reliability–SLA** trade-off. Providers degrade non-stationarily within an episode; the agent observes only a noisy windowed success signal (rolling success rate), not true internal health.
15
 
16
+ [![HF Space](https://img.shields.io/badge/🤗-Live%20Demo-yellow)](https://huggingface.co/spaces/akshay4/budget-router-openenv)
17
+ [![Demo Video](https://img.shields.io/badge/YouTube-Demo-red)](https://youtu.be/Z1A2zND_x70)
 
 
 
 
 
 
 
18
 
19
  ## TL;DR
20
 
21
+ **Hard_Multi is the headline scenario**: when Provider A degrades from step 0 and
22
+ Provider B cascades at step 10, reactive policies go negative while adaptive ones
23
+ stay positive. Three policy families, each stronger than the last, validated
24
+ across **30 paired seeds** in three independent buckets (dev, heldout, fresh):
25
+
26
+ | Policy | Hard_Multi grader | vs heuristic | Statistical evidence |
27
+ |---|---:|---|---|
28
+ | Heuristic (reactive) | 0.6076 ± 0.0361 (n=30) | | |
29
+ | LLM Qwen2.5-72B + budget-guard | 0.6515 ± 0.0523 (n=30) | **+7.2 %** | Cohen's d = **1.135** (large), paired one-sided p < 1×10⁻⁶, 24/30 wins, bootstrap 95 % CI on Δ = [0.031, 0.058] |
30
+ | PPO — SB3, 100k steps | **0.6907 ± 0.0326** (n=10 dev) | **+13.6 %** | 95 % CI [0.667, 0.714], **non-overlapping with heuristic**, 10/10 wins |
31
+
32
+ **Mechanism** (PPO): the agent learned to route A→B early and conserve budget
33
+ before B's cascade at step 10, pushing `adaptation_score` from 0.6907 (heuristic)
34
+ to **0.9328** — a +0.2421 gain on the grader's most diagnostic sub-score. The
35
+ LLM achieves a milder version of the same effect (+0.124 adaptation gain
36
+ across n=30) by anticipating the cascade in-context.
37
+
38
+ **Environment hardness**: heuristic reward goes negative (−2.97) on
39
+ Hard_Multi while oracle reaches +4.10 — a 7.07-point gap (≈238 % of the
40
+ heuristic's absolute reward) that confirms the cascade task is hard enough
41
+ to require RL/in-context reasoning and learnable enough to reward it.
42
+
43
+ **Honest scope** (explicitly disclosed):
44
+ - The LLM uses a deterministic **budget-safety guard** that vetoes routes which
45
+ would bankrupt the budget — a standard agentic-system pattern (LLM for
46
+ high-level decisions, deterministic layer for arithmetic-critical safety).
47
+ Without the guard, raw LLM occasionally exhausts budget and incurs the −10
48
+ cliff penalty.
49
+ - LLM (with guard) wins on **3 of 4 task tiers**: Medium (+5.8 %), Hard (+7.5 %),
50
+ Hard_Multi (+11.0 %). Loses Easy by −4.6 % — on a task with no degradation,
51
+ the budget-conservative heuristic is near-optimal and the LLM's added
52
+ flexibility is unhelpful.
53
+ - PPO is trained and evaluated on **Hard_Multi only**; not a general-purpose
54
+ policy. This is a deliberate choice — Hard_Multi has a 238 % oracle/heuristic
55
+ gap, the largest in the suite, so RL signal is highest there.
56
+ - All non-trivial improvement claims come from seeds the policy never saw
57
+ during design (heldout 100–109, fresh 200–209). Dev-seed wins are reported
58
+ separately and never used to make the headline claim.
59
+
60
+ ## Run locally
61
+ **Enable LLM policy locally**:
62
 
63
+ ```bash
64
+ export API_BASE_URL="https://<openai-compatible-endpoint>/v1" # e.g. https://router.huggingface.co/v1
65
+ export API_KEY="<your_key>"
66
+ export MODEL_NAME="<model_id>" # optional (e.g. Qwen/Qwen2.5-72B-Instruct)
67
  ```
68
 
 
 
 
 
 
 
 
 
 
69
 
70
+ ```bash
71
+ uv sync --extra training
72
+ uv run server
73
+ ```
 
 
 
 
74
 
75
+ Then open `http://127.0.0.1:8000/web` for the Gradio dashboard.
76
 
77
+ **REPRODUCIBLE RESULTS: use [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md) as the source-of-truth command checklist for evaluation numbers, traces, PPO workflow, and optional GRPO/SFT checks.**
 
 
 
 
78
 
79
+ **IF THE HUGGING FACE SPACE OR HF JOBS CODE PATH FAILS, RUN THE GITHUB/LOCAL CODE DIRECTLY FROM [`akshay-babbar/budget-router-openenv`](https://github.com/akshay-babbar/budget-router-openenv). THE GITHUB CODE IS THE MOST UP-TO-DATE VERSION.**
80
 
 
81
 
82
+ ## Benchmark results
 
 
 
 
 
83
 
84
+ Three policies evaluated:
85
 
86
+ - **Heuristic**: budget-aware, cheapest-viable baseline using only public
87
+ observations (`budget_router/policies.py`).
88
+ - **LLM**: Qwen2.5-72B via HuggingFace Inference Router, wrapped with a
89
+ deterministic budget-safety guard (`inference.py::_apply_budget_safety_guard`).
90
+ - **PPO**: MlpPolicy trained with Stable-Baselines3 on Hard_Multi (100k steps,
91
+ 4 parallel envs). See `train/train_ppo_hard_multi.py`.
92
+ - **Oracle†**: privileged upper-bound with internal-state access,
93
+ validation-only, not reported in tables.
94
 
95
+ **Dev seeds (0–9), full task suite** — `outputs/freeze_check_alltasks_dev10/eval_summary_*.md`:
96
 
97
+ | Task | Heuristic | LLM | PPO | LLM Δ vs heuristic |
98
+ |---|---:|---:|---:|---|
99
+ | Easy | 0.7718 | 0.7360 | — | −4.6 % *(7 losses, 0 wins, 3 ties)* |
100
+ | Medium | 0.6852 | 0.7250 | — | **+5.8 %** *(9 wins, 0 losses, 1 tie)* |
101
+ | Hard | 0.6354 | 0.6832 | — | **+7.5 %** *(8 wins, 2 losses, 0 ties)* |
102
+ | Hard_Multi | 0.6078 | 0.6746 | **0.6907** | **+11.0 %** *(8 wins, 1 loss, 1 tie)* |
103
 
104
+ PPO was trained and evaluated on Hard_Multi only; Easy/Medium/Hard cells are
105
+ intentionally blank (no model for those tasks).
 
 
 
 
 
106
 
107
+ **Statistical evidence Hard_Multi** (`outputs/freeze_check_*/eval_results_*.json`,
108
+ `outputs/ppo_hard_multi_eval.json`):
109
 
110
+ | | Heuristic | LLM | PPO |
111
+ |---|---|---|---|
112
+ | Mean grader | 0.6076 ± 0.0361 (n=30) | 0.6515 ± 0.0523 (n=30) | 0.6907 ± 0.0326 (n=10) |
113
+ | Bootstrap 95 % CI | [0.595, 0.620] | [0.633, 0.670] | [0.667, 0.714] |
114
+ | Paired Δ vs heuristic | | +0.0440 (boot 95 % CI [0.031, 0.058]) | +0.0829 |
115
+ | **Cohen's d (paired)** | | **1.135 (LARGE)** | **≈ 2.4 (HUGE)** |
116
+ | Paired one-sided p | — | **< 1 × 10⁻⁶** (Welch t = 6.22, df = 29) | (10/10 wins) |
117
+ | Sign-test wins / ties / losses | — | **24 / 3 / 3** | 10 / 0 / 0 |
118
+ | P(LLM > heuristic) — Agarwal 2021 | — | **0.80** | 1.00 |
119
+ | IQM of paired Δ — Agarwal 2021 | — | +0.040 (trimmed 25 %) | — |
120
+ | 95 % CI overlap with heuristic | — | None on the Δ | **None on the means** |
121
+ | Adaptation sub-score (mean) | 0.6878 | 0.8115 | **0.9328** |
122
 
123
+ **Per-bucket reproduction** (each row independent; LLM and heuristic share seeds,
124
+ so deltas are paired):
125
 
126
+ | Bucket | Seeds | Heuristic | LLM | Δ (rel %) | Wins / Ties / Losses |
127
+ |---|---|---:|---:|---:|---:|
128
+ | Dev | 0–9 | 0.6078 ± 0.0382 | 0.6746 ± 0.0486 | +0.0668 (+11.0 %) | 8 / 1 / 1 |
129
+ | **Heldout** | 100–109 | 0.6064 ± 0.0419 | 0.6454 ± 0.0497 | **+0.0390 (+6.4 %)** | **8 / 2 / 0** |
130
+ | **Fresh** | 200–209 | 0.6086 ± 0.0314 | 0.6347 ± 0.0551 | **+0.0261 (+4.3 %)** | **8 / 0 / 2** |
131
+ | **Combined non-dev** | 100–109 + 200–209 | 0.6075 | 0.6401 | **+0.0326 (+5.4 %)** | **16 / 2 / 2** |
132
 
133
+ ![Budget Router Evidence](figures/budget_router_evidence.png)
134
+ *Figure: (top-left) LLM advantage grows with task difficulty; (top-right)
135
+ three-policy ordering on Hard_Multi with non-overlapping 95% CIs;
136
+ (bottom-left) generalization across independent seed buckets including
137
+ post-freeze fresh seeds; (bottom-right) adaptation sub-score is the
138
+ primary driver of LLM and PPO gains over the reactive heuristic.*
139
 
140
+ The fresh-seed bucket (200–209) was added *after* the LLM prompt and budget
141
+ guard were frozen. It exists specifically to falsify a "tuned-on-heldout"
142
+ critique. The effect persists with no overlap to zero in the bootstrap CI.
143
 
144
+ <details>
145
+ <summary>🔬 Reproducing PPO Results (Optional)</summary>
146
 
147
+ The trained PPO policy for the hard_multi scenario is included at
148
+ `trained_models/ppo_hard_multi_100k.zip` (143 KB, trained 100k steps).
149
 
150
+ To reproduce the 10-seed evaluation locally:
151
 
152
+ ```bash
153
+ # Install dependencies
154
+ uv sync --extra training
155
 
156
+ # Run evaluation (writes to outputs/ppo_hard_multi_eval.json)
157
+ uv run python train/eval_hard_multi.py
 
 
 
 
158
  ```
159
 
160
+ Expected output: PPO mean = 0.691 ± 0.033 vs Heuristic mean = 0.608 ± 0.038,
161
+ win_rate = 1.0 (10/10 seeds), non-overlapping 95 % CIs.
 
 
 
 
 
 
 
 
 
 
 
162
 
163
+ > The deployed `inference.py` uses the LLM policy (Qwen2.5-72B + budget guard)
164
+ > as required by the hackathon specification. PPO was trained offline to
165
+ > validate environment depth and demonstrate that the task rewards genuine
166
+ > RL learning beyond reactive or in-context policies.
167
 
168
+ </details>
169
 
170
+ <details>
171
+ <summary>🔬 Reproducing LLM rigorous-stats Results (Optional)</summary>
172
 
173
+ ```bash
174
+ # Dev (seeds 0-9), full task suite
175
+ uv run python eval/eval_all.py \
176
+ --tasks easy --tasks medium --tasks hard --tasks hard_multi \
177
+ --policies heuristic --policies llm \
178
+ --seeds 10 --seed-set dev \
179
+ --out-dir outputs/freeze_check_alltasks_dev10
180
+
181
+ # Heldout (seeds 100-109), Hard_Multi
182
+ uv run python eval/eval_all.py \
183
+ --tasks hard_multi --policies heuristic --policies llm \
184
+ --seeds 10 --seed-set heldout \
185
+ --out-dir outputs/freeze_check_heldout10
186
+
187
+ # Fresh (seeds 200-209), Hard_Multi — uses --seed-values for arbitrary seeds
188
+ uv run python eval/eval_all.py \
189
+ --tasks hard_multi --policies heuristic --policies llm \
190
+ --seed-values "200,201,202,203,204,205,206,207,208,209" \
191
+ --out-dir outputs/freeze_check_fresh_200_209
192
+ ```
193
 
194
+ All three runs combined produce the n=30 rigorous-stats table above.
195
+ Episode-level JSON (per-step actions, rewards, sub-scores) is preserved
196
+ under each `outputs/freeze_check_*/` directory.
197
+
198
+ </details>
199
+
200
+ ## Why this benchmark has substance
201
+
202
+ - **Partial observability**: the agent-visible observation contains only `provider_a/b/c_status`, `budget_remaining`, `queue_backlog`, `system_latency`, and `step_count` (`budget_router/models.py`). True provider health is internal.
203
+ - **Non-stationarity**: task difficulty is created by explicit degradation schedules, culminating in Hard_Multi where A degrades from step 0 and B degrades from step 10 (`budget_router/tasks.py`).
204
+ - **Coupled constraints**: queue backlog amplifies latency, so routing errors create downstream SLA pressure rather than just local failures (`budget_router/environment.py`).
205
+ - **Meaningful evaluation**: the grader separately scores success, latency, budget, SLA, and adaptation; for Hard_Multi, adaptation is explicitly split across the two degradation windows (`budget_router/reward.py`).
206
+ - **RL learnability confirmed**: a PPO agent trained from scratch in 100k steps
207
+ achieves non-overlapping 95 % CIs above the heuristic on Hard_Multi
208
+ (`train/eval_hard_multi.py`), confirming the cascade signal is learnable
209
+ beyond reactive or in-context policies.
210
+ - **Anti-gaming, anti-overfitting tested**: 41 unit tests + 36 hard validation
211
+ assertions including degenerate-policy guards (always-A, always-B, always-shed
212
+ all dominated by baseline), grader-exploit guards (pure abstention scores
213
+ below 0.40 on Easy), heldout stability checks, and zero-NaN/zero-crash
214
+ invariants across 315 episodes.
215
+
216
+ ### Oracle–Baseline reward gap (verified, n=10 seeds each, dev set)
217
+
218
+ | Scenario | Oracle† | Heuristic | Gap | Signal |
219
+ |---|---|---|---|---|
220
+ | Easy | +10.10 | +6.98 | 3.12 (45 %) | Heuristic competitive |
221
+ | Medium | +9.49 | +2.53 | 6.96 (275 %) | Meaningful headroom |
222
+ | Hard | +6.54 | +0.88 | 5.66 (643 %) | Heuristic nearly fails |
223
+ | **Hard_Multi** | **+4.10** | **−2.97** | **7.07 (238 % of \|baseline\|)** | **Heuristic actively harmful** |
224
+
225
+ *† Oracle has privileged access to internal provider health — theoretical ceiling only, not a deployable policy.*
226
+
227
+ On Hard_Multi the heuristic reward goes negative (−2.97): the rule-based
228
+ policy exhausts budget mid-cascade and actively destroys episode value.
229
+ Oracle stays strongly positive (+4.10). The 7.07-point gap — 238 % above the
230
+ heuristic's absolute reward — is what produces the large advantage signal that
231
+ allows PPO to find a meaningful gradient in 100k steps and the LLM to find a
232
+ Cohen's-d ≈ 1.1 effect zero-shot.
233
 
234
+ ```mermaid
235
+ flowchart LR
236
+ subgraph Policy["Policy Layer"]
237
+ H["Heuristic"]
238
+ L["LLM (Qwen2.5-72B + budget guard)"]
239
+ P["PPO (SB3, Hard_Multi)"]
240
+ end
241
+
242
+ subgraph Env["BudgetRouterEnv (OpenEnv)"]
243
+ direction TB
244
+ O["Observation: provider_statuses, budget, backlog, latency, step"]
245
+ A["Actions: route_to_a, route_to_b, route_to_c, shed_load"]
246
+ R["Reward: success/fail + cost + SLA penalty, -10 on budget exhaustion"]
247
+ G["Episode grader: success, adaptation, latency, budget, SLA"]
248
+ O --> A --> R --> G
249
+ end
250
+
251
+ subgraph Tasks["Task presets"]
252
+ E["Easy"]
253
+ M["Medium"]
254
+ Hd["Hard"]
255
+ HM["Hard_Multi (cascade)"]
256
+ end
257
+
258
+ Policy -->|"action"| Env
259
+ Env -->|"obs + reward"| Policy
260
+ Tasks -->|"scenario config"| Env
261
+ ```
262
 
263
+ ## Tasks (what changes across difficulty)
264
 
265
+ | Task | Budget ($) | Degradation schedule |
266
+ |---|---:|---|
267
+ | Easy | 1.00 | None (`degradation_start_step=999`) |
268
+ | Medium | 0.95 | A degrades after step 5 (`rate=0.15`) |
269
+ | Hard | 0.85 | A degrades from step 0 (`rate=0.15`) |
270
+ | Hard_Multi | 1.10 | A degrades from step 0 (`rate=0.12`), then B from step 10 (`rate=0.10`) |
271
 
272
+ Hard_Multi is the headline scenario: once B starts degrading at step 10, C becomes the only consistently reliable option. Since `cost_c=$0.10/request`, the final 10 steps alone can consume `$1.00` of the `$1.10` budget, making **early budget conservation** a binding constraint.
273
 
274
+ ## Grader (episode score)
275
 
276
+ The episode grader is a weighted score in `[0,1]`:
277
 
278
+ `overall = 0.30·success + 0.20·latency + 0.15·budget + 0.15·SLA + 0.20·adaptation`
279
 
280
+ Notes (from `budget_router/reward.py`):
281
 
282
+ - `success_score` is computed over **all episode steps** (shed-load/abstention is penalized).
283
+ - `adaptation_score` evaluates post-degradation success. For Hard_Multi it is a blended window: 0.5×(after A degrades, before B) + 0.5×(after B degrades).
284
 
285
+ ## Evaluation protocol (reproducibility)
286
 
287
+ - **Three independent seed buckets**: dev (0–9) used during policy design;
288
+ heldout (100–109) used to falsify dev-seed overfitting; fresh (200–209)
289
+ added *after* the LLM and PPO were frozen to falsify "tuned-on-heldout"
290
+ concerns. See `eval/eval_all.py::SEED_SETS` and the `--seed-values` CLI
291
+ option for arbitrary seed lists.
292
+ - **Scripted runs**: `eval/eval_all.py` writes timestamped artifacts under
293
+ `outputs/`. Per-episode JSON includes per-step `actions`, `rewards`, and
294
+ the full grader sub-score breakdown.
295
+ - **Statistical reporting**: We report Cohen's d, paired Welch t-test,
296
+ bootstrap 95 % confidence intervals, IQM, and probability of improvement
297
+ in line with [Agarwal et al. 2021 (NeurIPS Outstanding Paper)](https://arxiv.org/abs/2108.13264)
298
+ and [Henderson et al. 2018](https://arxiv.org/abs/1709.06560)'s reproducibility
299
+ recommendations. Sample size n=30 (combined buckets) exceeds the Colas
300
+ et al. 2018 recommended power-analysis floor for our observed effect size.
301
+ - **Anti-cheating tests**: `budget_router/tests/test_environment.py::TestGraderSemantics`
302
+ verifies that pure abstention scores below 0.40 on Easy and that
303
+ partial abstention always scores worse than full service.
304
 
305
+ ## Getting started
306
 
307
+ 1. Install dependencies:
308
 
309
+ ```bash
310
+ uv sync
 
 
 
 
 
311
  ```
312
 
313
+ 2. (Optional, for LLM policy) set an OpenAI-compatible endpoint:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
314
 
315
  ```bash
316
+ export API_BASE_URL=https://router.huggingface.co/v1
317
+ export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
318
+ export HF_TOKEN=... # or API_KEY
 
 
 
 
 
 
 
 
 
319
  ```
320
 
321
+ 3. Run evaluation (writes to `outputs/`):
 
 
 
 
 
 
 
 
322
 
323
+ ```bash
324
+ # Single-task heldout reproduction
325
+ uv run python eval/eval_all.py \
326
+ --tasks hard_multi --seed-set heldout --seeds 10 \
327
+ --policies heuristic --policies llm \
328
+ --out-dir outputs/heldout_repro
329
+
330
+ # Full task suite, dev
331
+ uv run python eval/eval_all.py \
332
+ --tasks easy --tasks medium --tasks hard --tasks hard_multi \
333
+ --policies heuristic --policies llm \
334
+ --seeds 10 --seed-set dev \
335
+ --out-dir outputs/dev_repro
336
+ ```
337
 
338
+ ## References
 
 
 
339
 
340
+ - Altman (1999): *Constrained Markov Decision Processes*.
341
+ - Henderson, Islam, Bachman, Pineau, Precup, Meger ([arXiv:1709.06560](https://arxiv.org/abs/1709.06560), AAAI 2018): *Deep Reinforcement Learning that Matters* — foundational reproducibility study; motivated multi-bucket seed evaluation here.
342
+ - Colas, Sigaud, Oudeyer ([arXiv:1806.08295](https://arxiv.org/abs/1806.08295), 2018): *How Many Random Seeds? Statistical Power Analysis in Deep RL Experiments* — power-analysis basis for n=30.
343
+ - Agarwal, Schwarzer, Castro, Courville, Bellemare ([arXiv:2108.13264](https://arxiv.org/abs/2108.13264), NeurIPS 2021 Outstanding Paper): *Deep RL at the Edge of the Statistical Precipice* — IQM, bootstrap CIs, probability-of-improvement adopted in the statistical-evidence table.