Spaces:
Running
Running
deploy: update README.md
Browse files
README.md
CHANGED
|
@@ -21,6 +21,10 @@ pinned: true
|
|
| 21 |
> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
|
| 22 |
> Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
---
|
| 25 |
|
| 26 |
## Problem Statement
|
|
@@ -41,7 +45,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
|
|
| 41 |
| **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
|
| 42 |
| **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
|
| 43 |
| **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
|
| 44 |
-
| **Mini-Blog** | [docs/
|
| 45 |
|
| 46 |
---
|
| 47 |
|
|
@@ -50,20 +54,20 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
|
|
| 50 |
| Criterion | Weight | Where to find the evidence |
|
| 51 |
|---|---|---|
|
| 52 |
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 53 |
-
| **Storytelling & Presentation** | 30% | [`docs/
|
| 54 |
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
|
| 55 |
-
| **Reward & Training Pipeline** | 10% | [`app/
|
| 56 |
|
| 57 |
### Minimum-requirement checklist (for judges)
|
| 58 |
|
| 59 |
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 60 |
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 61 |
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
|
| 62 |
-
- [x] **Mini-blog** at [`docs/
|
| 63 |
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 64 |
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
| 65 |
- [x] **`openenv.yaml`** manifest valid — see repo root
|
| 66 |
-
- [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `
|
| 67 |
|
| 68 |
---
|
| 69 |
|
|
@@ -85,7 +89,13 @@ hand-edits, no rounded-up estimates. Source:
|
|
| 85 |
| Hardware | L4 GPU (HF Jobs), 3 h 3 min |
|
| 86 |
| WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
|
| 87 |
|
| 88 |
-
### Headline result
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
|
| 91 |
|
|
@@ -116,6 +126,21 @@ Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
|
|
| 116 |
Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
|
| 117 |
[`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
---
|
| 120 |
|
| 121 |
## Quick Start for Reviewers (3 minutes)
|
|
@@ -343,7 +368,7 @@ debateFloor/
|
|
| 343 |
├── docs/
|
| 344 |
│ ├── reward_curve.svg ← training reward curve (embedded above)
|
| 345 |
│ ├── component_shift.svg ← before/after component scores (embedded above)
|
| 346 |
-
│ └──
|
| 347 |
│
|
| 348 |
└── reports/
|
| 349 |
├── training_summary.json
|
|
|
|
| 21 |
> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
|
| 22 |
> Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
|
| 23 |
|
| 24 |
+
> ### 🎯 Headline result — Calibration score 0.000 → 1.000 on held-out claims
|
| 25 |
+
>
|
| 26 |
+
> Across a 5,000-episode GRPO run on Qwen2.5-0.5B-Instruct, the trained agent's confidence now matches its correctness on *every* held-out terminal action — directly attacking the GRPO overconfidence pathology documented in [CAPO (arXiv:2604.12632)](https://arxiv.org/abs/2604.12632). Decision accuracy moved 0.000 → 1.000 on the same eval. Both numbers read straight from [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — no hand-edits.
|
| 27 |
+
|
| 28 |
---
|
| 29 |
|
| 30 |
## Problem Statement
|
|
|
|
| 45 |
| **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
|
| 46 |
| **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
|
| 47 |
| **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
|
| 48 |
+
| **Mini-Blog** | [docs/BLOG.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/BLOG.md) |
|
| 49 |
|
| 50 |
---
|
| 51 |
|
|
|
|
| 54 |
| Criterion | Weight | Where to find the evidence |
|
| 55 |
|---|---|---|
|
| 56 |
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 57 |
+
| **Storytelling & Presentation** | 30% | [`docs/BLOG.md`](docs/BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
|
| 58 |
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
|
| 59 |
+
| **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
|
| 60 |
|
| 61 |
### Minimum-requirement checklist (for judges)
|
| 62 |
|
| 63 |
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 64 |
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 65 |
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
|
| 66 |
+
- [x] **Mini-blog** at [`docs/BLOG.md`](docs/BLOG.md)
|
| 67 |
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 68 |
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
| 69 |
- [x] **`openenv.yaml`** manifest valid — see repo root
|
| 70 |
+
- [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `server/`
|
| 71 |
|
| 72 |
---
|
| 73 |
|
|
|
|
| 89 |
| Hardware | L4 GPU (HF Jobs), 3 h 3 min |
|
| 90 |
| WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
|
| 91 |
|
| 92 |
+
### Headline result
|
| 93 |
+
|
| 94 |
+
- **Training reward: 0.130 → 0.469 (3.6× improvement)** across 2,500 GRPO steps.
|
| 95 |
+
- **Held-out decision accuracy: 0.000 → 1.000** — the trained model gets every held-out claim right.
|
| 96 |
+
- **Held-out calibration score: 0.000 → 1.000** — confidence now matches correctness on every terminal action. *This is the skill the 3×2 matrix was designed to teach, and the result attacks the overconfidence pathology the [CAPO paper](https://arxiv.org/abs/2604.12632) documented in GRPO.*
|
| 97 |
+
|
| 98 |
+
All three numbers are read directly from [`reports/training_summary.json`](reports/training_summary.json) and [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — no hand-edits.
|
| 99 |
|
| 100 |
### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
|
| 101 |
|
|
|
|
| 126 |
Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
|
| 127 |
[`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
|
| 128 |
|
| 129 |
+
### Environment baseline — 25 episodes against the live HF Space
|
| 130 |
+
|
| 131 |
+
Independent end-to-end check that the deployed environment is reachable, deterministic, and produces graded rewards across **all 5 tasks × 5 seeds = 25 episodes** via the public HTTP API. Source: [`reports/eval_report.md`](reports/eval_report.md).
|
| 132 |
+
|
| 133 |
+
| Task | Reward | Evidence Quality | Exploit Penalty | Steps |
|
| 134 |
+
|---|---:|---:|---:|---:|
|
| 135 |
+
| `clean_claim` | 0.8725 | 1.0000 | 0.0000 | 4 |
|
| 136 |
+
| `contradictory_claim` | 0.7497 | 1.0000 | 0.0000 | 8 |
|
| 137 |
+
| `coordinated_fraud` | 0.8230 | 1.0000 | 0.0000 | 12 |
|
| 138 |
+
| `distribution_shift_claim` | 0.7827 | 1.0000 | 0.0000 | 12 |
|
| 139 |
+
| `identity_fraud` | 0.8180 | 1.0000 | 0.0000 | 10 |
|
| 140 |
+
| **Average** | **0.8092** | **1.0000** | **0.0000** | — |
|
| 141 |
+
|
| 142 |
+
**Completion rate: 100%** across 25 episodes (variants 0–4 visited per task). This is a *scripted* baseline — fixed strategies per `task_id` — so seeds within a task return the same reward by design (see the note in `eval_report.md`). Its job is to prove the env's reward surface is reproducible and the live HTTP path works end-to-end; the *trained model's* improvement is in the component-shift table above.
|
| 143 |
+
|
| 144 |
---
|
| 145 |
|
| 146 |
## Quick Start for Reviewers (3 minutes)
|
|
|
|
| 368 |
├── docs/
|
| 369 |
│ ├── reward_curve.svg ← training reward curve (embedded above)
|
| 370 |
│ ├── component_shift.svg ← before/after component scores (embedded above)
|
| 371 |
+
│ └── BLOG.md ← writeup
|
| 372 |
│
|
| 373 |
└── reports/
|
| 374 |
├── training_summary.json
|