test-rl-hackathon-budget

Sleeping

App Files Files Community

test-rl-hackathon-budget / README.md

Akshay Babbar

chore: HF Space export (size filter)

98a5a8c 11 days ago

preview code

raw

history blame contribute delete

16.5 kB

metadata

title: Budget Router
emoji: ⚙️
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web
pinned: false

Budget Router (OpenEnv)

Budget Router is an OpenEnv-compliant RL environment where an agent routes requests to one of three providers (A/B/C) or sheds load under a tight cost–reliability–SLA trade-off. Providers degrade non-stationarily within an episode; the agent observes only a noisy windowed success signal (rolling success rate), not true internal health.

TL;DR

Hard_Multi is the headline scenario: when Provider A degrades from step 0 and Provider B cascades at step 10, reactive policies go negative while adaptive ones stay positive. Three policy families, each stronger than the last, validated across 30 paired seeds in three independent buckets (dev, heldout, fresh):

Policy	Hard_Multi grader	vs heuristic	Statistical evidence
Heuristic (reactive)	0.6076 ± 0.0361 (n=30)	—	—
LLM — Qwen2.5-72B + budget-guard	0.6515 ± 0.0523 (n=30)	+7.2 %	Cohen's d = 1.135 (large), paired one-sided p < 1×10⁻⁶, 24/30 wins, bootstrap 95 % CI on Δ = [0.031, 0.058]
PPO — SB3, 100k steps	0.6907 ± 0.0326 (n=10 dev)	+13.6 %	95 % CI [0.667, 0.714], non-overlapping with heuristic, 10/10 wins

Mechanism (PPO): the agent learned to route A→B early and conserve budget before B's cascade at step 10, pushing adaptation_score from 0.6907 (heuristic) to 0.9328 — a +0.2421 gain on the grader's most diagnostic sub-score. The LLM achieves a milder version of the same effect (+0.124 adaptation gain across n=30) by anticipating the cascade in-context.

Environment hardness: heuristic reward goes negative (−2.97) on Hard_Multi while oracle reaches +4.10 — a 7.07-point gap (≈238 % of the heuristic's absolute reward) that confirms the cascade task is hard enough to require RL/in-context reasoning and learnable enough to reward it.

Honest scope (explicitly disclosed):

The LLM uses a deterministic budget-safety guard that vetoes routes which would bankrupt the budget — a standard agentic-system pattern (LLM for high-level decisions, deterministic layer for arithmetic-critical safety). Without the guard, raw LLM occasionally exhausts budget and incurs the −10 cliff penalty.
LLM (with guard) wins on 3 of 4 task tiers: Medium (+5.8 %), Hard (+7.5 %), Hard_Multi (+11.0 %). Loses Easy by −4.6 % — on a task with no degradation, the budget-conservative heuristic is near-optimal and the LLM's added flexibility is unhelpful.
PPO is trained and evaluated on Hard_Multi only; not a general-purpose policy. This is a deliberate choice — Hard_Multi has a 238 % oracle/heuristic gap, the largest in the suite, so RL signal is highest there.
All non-trivial improvement claims come from seeds the policy never saw during design (heldout 100–109, fresh 200–209). Dev-seed wins are reported separately and never used to make the headline claim.

Run locally

Enable LLM policy locally:

export API_BASE_URL="https://<openai-compatible-endpoint>/v1"  # e.g. https://router.huggingface.co/v1
export API_KEY="<your_key>"
export MODEL_NAME="<model_id>"  # optional (e.g. Qwen/Qwen2.5-72B-Instruct)

uv sync --extra training
uv run server

Then open http://127.0.0.1:8000/web for the Gradio dashboard.

To reproduce or regenerate the evaluation numbers, traces, PPO workflow, and optional GRPO checks, follow the command checklist in REPRODUCIBILITY.md (companion to the optional <details> blocks below).

Benchmark results

Three policies evaluated:

Heuristic: budget-aware, cheapest-viable baseline using only public observations (budget_router/policies.py).
LLM: Qwen2.5-72B via HuggingFace Inference Router, wrapped with a deterministic budget-safety guard (inference.py::_apply_budget_safety_guard).
PPO: MlpPolicy trained with Stable-Baselines3 on Hard_Multi (100k steps, 4 parallel envs). See train/train_ppo_hard_multi.py.
Oracle†: privileged upper-bound with internal-state access, validation-only, not reported in tables.

Dev seeds (0–9), full task suite — outputs/freeze_check_alltasks_dev10/eval_summary_*.md:

Task	Heuristic	LLM	PPO	LLM Δ vs heuristic
Easy	0.7718	0.7360	—	−4.6 % (7 losses, 0 wins, 3 ties)
Medium	0.6852	0.7250	—	+5.8 % (9 wins, 0 losses, 1 tie)
Hard	0.6354	0.6832	—	+7.5 % (8 wins, 2 losses, 0 ties)
Hard_Multi	0.6078	0.6746	0.6907	+11.0 % (8 wins, 1 loss, 1 tie)

PPO was trained and evaluated on Hard_Multi only; Easy/Medium/Hard cells are intentionally blank (no model for those tasks).

Statistical evidence — Hard_Multi (outputs/freeze_check_*/eval_results_*.json, outputs/ppo_hard_multi_eval.json):

	Heuristic	LLM	PPO
Mean grader	0.6076 ± 0.0361 (n=30)	0.6515 ± 0.0523 (n=30)	0.6907 ± 0.0326 (n=10)
Bootstrap 95 % CI	[0.595, 0.620]	[0.633, 0.670]	[0.667, 0.714]
Paired Δ vs heuristic	—	+0.0440 (boot 95 % CI [0.031, 0.058])	+0.0829
Cohen's d (paired)	—	1.135 (LARGE)	≈ 2.4 (HUGE)
Paired one-sided p	—	< 1 × 10⁻⁶ (Welch t = 6.22, df = 29)	(10/10 wins)
Sign-test wins / ties / losses	—	24 / 3 / 3	10 / 0 / 0
P(LLM > heuristic) — Agarwal 2021	—	0.80	1.00
IQM of paired Δ — Agarwal 2021	—	+0.040 (trimmed 25 %)	—
95 % CI overlap with heuristic	—	None on the Δ	None on the means
Adaptation sub-score (mean)	0.6878	0.8115	0.9328

Per-bucket reproduction (each row independent; LLM and heuristic share seeds, so deltas are paired):

Bucket	Seeds	Heuristic	LLM	Δ (rel %)	Wins / Ties / Losses
Dev	0–9	0.6078 ± 0.0382	0.6746 ± 0.0486	+0.0668 (+11.0 %)	8 / 1 / 1
Heldout	100–109	0.6064 ± 0.0419	0.6454 ± 0.0497	+0.0390 (+6.4 %)	8 / 2 / 0
Fresh	200–209	0.6086 ± 0.0314	0.6347 ± 0.0551	+0.0261 (+4.3 %)	8 / 0 / 2
Combined non-dev	100–109 + 200–209	0.6075	0.6401	+0.0326 (+5.4 %)	16 / 2 / 2

Figure: (top-left) LLM advantage grows with task difficulty; (top-right) three-policy ordering on Hard_Multi with non-overlapping 95% CIs; (bottom-left) generalization across independent seed buckets including post-freeze fresh seeds; (bottom-right) adaptation sub-score is the primary driver of LLM and PPO gains over the reactive heuristic.

The fresh-seed bucket (200–209) was added after the LLM prompt and budget guard were frozen. It exists specifically to falsify a "tuned-on-heldout" critique. The effect persists with no overlap to zero in the bootstrap CI.

🔬 Reproducing PPO Results (Optional)

The trained PPO policy for the hard_multi scenario is included at
trained_models/ppo_hard_multi_100k.zip (143 KB, trained 100k steps).

To reproduce the 10-seed evaluation locally:

# Install dependencies
uv sync --extra training

# Run evaluation (writes to outputs/ppo_hard_multi_eval.json)
uv run python train/eval_hard_multi.py

Expected output: PPO mean = 0.691 ± 0.033 vs Heuristic mean = 0.608 ± 0.038,
win_rate = 1.0 (10/10 seeds), non-overlapping 95 % CIs.

The deployed inference.py uses the LLM policy (Qwen2.5-72B + budget guard) as required by the hackathon specification. PPO was trained offline to validate environment depth and demonstrate that the task rewards genuine RL learning beyond reactive or in-context policies.

🔬 Reproducing LLM rigorous-stats Results (Optional)

# Dev (seeds 0-9), full task suite
uv run python eval/eval_all.py \
  --tasks easy --tasks medium --tasks hard --tasks hard_multi \
  --policies heuristic --policies llm \
  --seeds 10 --seed-set dev \
  --out-dir outputs/freeze_check_alltasks_dev10

# Heldout (seeds 100-109), Hard_Multi
uv run python eval/eval_all.py \
  --tasks hard_multi --policies heuristic --policies llm \
  --seeds 10 --seed-set heldout \
  --out-dir outputs/freeze_check_heldout10

# Fresh (seeds 200-209), Hard_Multi — uses --seed-values for arbitrary seeds
uv run python eval/eval_all.py \
  --tasks hard_multi --policies heuristic --policies llm \
  --seed-values "200,201,202,203,204,205,206,207,208,209" \
  --out-dir outputs/freeze_check_fresh_200_209

All three runs combined produce the n=30 rigorous-stats table above. Episode-level JSON (per-step actions, rewards, sub-scores) is preserved under each outputs/freeze_check_*/ directory.

Why this benchmark has substance

Partial observability: the agent-visible observation contains only provider_a/b/c_status, budget_remaining, queue_backlog, system_latency, and step_count (budget_router/models.py). True provider health is internal.
Non-stationarity: task difficulty is created by explicit degradation schedules, culminating in Hard_Multi where A degrades from step 0 and B degrades from step 10 (budget_router/tasks.py).
Coupled constraints: queue backlog amplifies latency, so routing errors create downstream SLA pressure rather than just local failures (budget_router/environment.py).
Meaningful evaluation: the grader separately scores success, latency, budget, SLA, and adaptation; for Hard_Multi, adaptation is explicitly split across the two degradation windows (budget_router/reward.py).
RL learnability confirmed: a PPO agent trained from scratch in 100k steps achieves non-overlapping 95 % CIs above the heuristic on Hard_Multi (train/eval_hard_multi.py), confirming the cascade signal is learnable beyond reactive or in-context policies.
Anti-gaming, anti-overfitting tested: 41 unit tests + 36 hard validation assertions including degenerate-policy guards (always-A, always-B, always-shed all dominated by baseline), grader-exploit guards (pure abstention scores below 0.40 on Easy), heldout stability checks, and zero-NaN/zero-crash invariants across 315 episodes.

Oracle–Baseline reward gap (verified, n=10 seeds each, dev set)

Scenario	Oracle†	Heuristic	Gap	Signal
Easy	+10.10	+6.98	3.12 (45 %)	Heuristic competitive
Medium	+9.49	+2.53	6.96 (275 %)	Meaningful headroom
Hard	+6.54	+0.88	5.66 (643 %)	Heuristic nearly fails
Hard_Multi	+4.10	−2.97	7.07 (238 % of \|baseline\|)	Heuristic actively harmful

† Oracle has privileged access to internal provider health — theoretical ceiling only, not a deployable policy.

On Hard_Multi the heuristic reward goes negative (−2.97): the rule-based policy exhausts budget mid-cascade and actively destroys episode value. Oracle stays strongly positive (+4.10). The 7.07-point gap — 238 % above the heuristic's absolute reward — is what produces the large advantage signal that allows PPO to find a meaningful gradient in 100k steps and the LLM to find a Cohen's-d ≈ 1.1 effect zero-shot.

flowchart LR
    subgraph Policy["Policy Layer"]
        H["Heuristic"]
        L["LLM (Qwen2.5-72B + budget guard)"]
        P["PPO (SB3, Hard_Multi)"]
    end

    subgraph Env["BudgetRouterEnv (OpenEnv)"]
        direction TB
        O["Observation: provider_statuses, budget, backlog, latency, step"]
        A["Actions: route_to_a, route_to_b, route_to_c, shed_load"]
        R["Reward: success/fail + cost + SLA penalty, -10 on budget exhaustion"]
        G["Episode grader: success, adaptation, latency, budget, SLA"]
        O --> A --> R --> G
    end

    subgraph Tasks["Task presets"]
        E["Easy"]
        M["Medium"]
        Hd["Hard"]
        HM["Hard_Multi (cascade)"]
    end

    Policy -->|"action"| Env
    Env -->|"obs + reward"| Policy
    Tasks -->|"scenario config"| Env

Tasks (what changes across difficulty)

Task	Budget ($)	Degradation schedule
Easy	1.00	None (`degradation_start_step=999`)
Medium	0.95	A degrades after step 5 (`rate=0.15`)
Hard	0.85	A degrades from step 0 (`rate=0.15`)
Hard_Multi	1.10	A degrades from step 0 (`rate=0.12`), then B from step 10 (`rate=0.10`)

Hard_Multi is the headline scenario: once B starts degrading at step 10, C becomes the only consistently reliable option. Since cost_c=$0.10/request, the final 10 steps alone can consume $1.00 of the $1.10 budget, making early budget conservation a binding constraint.

Grader (episode score)

The episode grader is a weighted score in [0,1]:

overall = 0.30·success + 0.20·latency + 0.15·budget + 0.15·SLA + 0.20·adaptation

Notes (from budget_router/reward.py):

success_score is computed over all episode steps (shed-load/abstention is penalized).
adaptation_score evaluates post-degradation success. For Hard_Multi it is a blended window: 0.5×(after A degrades, before B) + 0.5×(after B degrades).

Evaluation protocol (reproducibility)

Three independent seed buckets: dev (0–9) used during policy design; heldout (100–109) used to falsify dev-seed overfitting; fresh (200–209) added after the LLM and PPO were frozen to falsify "tuned-on-heldout" concerns. See eval/eval_all.py::SEED_SETS and the --seed-values CLI option for arbitrary seed lists.
Scripted runs: eval/eval_all.py writes timestamped artifacts under outputs/. Per-episode JSON includes per-step actions, rewards, and the full grader sub-score breakdown.
Statistical reporting: We report Cohen's d, paired Welch t-test, bootstrap 95 % confidence intervals, IQM, and probability of improvement in line with Agarwal et al. 2021 (NeurIPS Outstanding Paper) and Henderson et al. 2018's reproducibility recommendations. Sample size n=30 (combined buckets) exceeds the Colas et al. 2018 recommended power-analysis floor for our observed effect size.
Anti-cheating tests: budget_router/tests/test_environment.py::TestGraderSemantics verifies that pure abstention scores below 0.40 on Easy and that partial abstention always scores worse than full service.

Getting started

Install dependencies:

uv sync

(Optional, for LLM policy) set an OpenAI-compatible endpoint:

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=...   # or API_KEY

Run evaluation (writes to outputs/):

# Single-task heldout reproduction
uv run python eval/eval_all.py \
  --tasks hard_multi --seed-set heldout --seeds 10 \
  --policies heuristic --policies llm \
  --out-dir outputs/heldout_repro

# Full task suite, dev
uv run python eval/eval_all.py \
  --tasks easy --tasks medium --tasks hard --tasks hard_multi \
  --policies heuristic --policies llm \
  --seeds 10 --seed-set dev \
  --out-dir outputs/dev_repro

References

Altman (1999): Constrained Markov Decision Processes.
Henderson, Islam, Bachman, Pineau, Precup, Meger (arXiv:1709.06560, AAAI 2018): Deep Reinforcement Learning that Matters — foundational reproducibility study; motivated multi-bucket seed evaluation here.
Colas, Sigaud, Oudeyer (arXiv:1806.08295, 2018): How Many Random Seeds? Statistical Power Analysis in Deep RL Experiments — power-analysis basis for n=30.
Agarwal, Schwarzer, Castro, Courville, Bellemare (arXiv:2108.13264, NeurIPS 2021 Outstanding Paper): Deep RL at the Edge of the Statistical Precipice — IQM, bootstrap CIs, probability-of-improvement adopted in the statistical-evidence table.