AniketAsla commited on
Commit
eac477f
·
verified ·
1 Parent(s): bd77971

deploy: update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -7
README.md CHANGED
@@ -21,6 +21,10 @@ pinned: true
21
  > An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
22
  > Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
23
 
 
 
 
 
24
  ---
25
 
26
  ## Problem Statement
@@ -41,7 +45,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
41
  | **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
42
  | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
43
  | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
44
- | **Mini-Blog** | [docs/HFBlogPost.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/HFBlogPost.md) |
45
 
46
  ---
47
 
@@ -50,20 +54,20 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
50
  | Criterion | Weight | Where to find the evidence |
51
  |---|---|---|
52
  | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
53
- | **Storytelling & Presentation** | 30% | [`docs/HFBlogPost.md`](docs/HFBlogPost.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
54
  | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
55
- | **Reward & Training Pipeline** | 10% | [`app/services/reward.py`](app/services/reward.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
56
 
57
  ### Minimum-requirement checklist (for judges)
58
 
59
  - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
60
  - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
61
  - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
62
- - [x] **Mini-blog** at [`docs/HFBlogPost.md`](docs/HFBlogPost.md)
63
  - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
64
  - [x] **README** motivates the problem, explains the env, and shows results (this file)
65
  - [x] **`openenv.yaml`** manifest valid — see repo root
66
- - [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `clients/`
67
 
68
  ---
69
 
@@ -85,7 +89,13 @@ hand-edits, no rounded-up estimates. Source:
85
  | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
86
  | WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
87
 
88
- ### Headline result: training reward 0.130 → 0.469 (3.6× improvement)
 
 
 
 
 
 
89
 
90
  ### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
91
 
@@ -116,6 +126,21 @@ Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
116
  Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
117
  [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  ---
120
 
121
  ## Quick Start for Reviewers (3 minutes)
@@ -343,7 +368,7 @@ debateFloor/
343
  ├── docs/
344
  │ ├── reward_curve.svg ← training reward curve (embedded above)
345
  │ ├── component_shift.svg ← before/after component scores (embedded above)
346
- │ └── HFBlogPost.md ← writeup
347
 
348
  └── reports/
349
  ├── training_summary.json
 
21
  > An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
22
  > Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
23
 
24
+ > ### 🎯 Headline result — Calibration score 0.000 → 1.000 on held-out claims
25
+ >
26
+ > Across a 5,000-episode GRPO run on Qwen2.5-0.5B-Instruct, the trained agent's confidence now matches its correctness on *every* held-out terminal action — directly attacking the GRPO overconfidence pathology documented in [CAPO (arXiv:2604.12632)](https://arxiv.org/abs/2604.12632). Decision accuracy moved 0.000 → 1.000 on the same eval. Both numbers read straight from [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — no hand-edits.
27
+
28
  ---
29
 
30
  ## Problem Statement
 
45
  | **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
46
  | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
47
  | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
48
+ | **Mini-Blog** | [docs/BLOG.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/BLOG.md) |
49
 
50
  ---
51
 
 
54
  | Criterion | Weight | Where to find the evidence |
55
  |---|---|---|
56
  | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
57
+ | **Storytelling & Presentation** | 30% | [`docs/BLOG.md`](docs/BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
58
  | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
59
+ | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
60
 
61
  ### Minimum-requirement checklist (for judges)
62
 
63
  - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
64
  - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
65
  - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
66
+ - [x] **Mini-blog** at [`docs/BLOG.md`](docs/BLOG.md)
67
  - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
68
  - [x] **README** motivates the problem, explains the env, and shows results (this file)
69
  - [x] **`openenv.yaml`** manifest valid — see repo root
70
+ - [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `server/`
71
 
72
  ---
73
 
 
89
  | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
90
  | WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
91
 
92
+ ### Headline result
93
+
94
+ - **Training reward: 0.130 → 0.469 (3.6× improvement)** across 2,500 GRPO steps.
95
+ - **Held-out decision accuracy: 0.000 → 1.000** — the trained model gets every held-out claim right.
96
+ - **Held-out calibration score: 0.000 → 1.000** — confidence now matches correctness on every terminal action. *This is the skill the 3×2 matrix was designed to teach, and the result attacks the overconfidence pathology the [CAPO paper](https://arxiv.org/abs/2604.12632) documented in GRPO.*
97
+
98
+ All three numbers are read directly from [`reports/training_summary.json`](reports/training_summary.json) and [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — no hand-edits.
99
 
100
  ### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
101
 
 
126
  Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
127
  [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
128
 
129
+ ### Environment baseline — 25 episodes against the live HF Space
130
+
131
+ Independent end-to-end check that the deployed environment is reachable, deterministic, and produces graded rewards across **all 5 tasks × 5 seeds = 25 episodes** via the public HTTP API. Source: [`reports/eval_report.md`](reports/eval_report.md).
132
+
133
+ | Task | Reward | Evidence Quality | Exploit Penalty | Steps |
134
+ |---|---:|---:|---:|---:|
135
+ | `clean_claim` | 0.8725 | 1.0000 | 0.0000 | 4 |
136
+ | `contradictory_claim` | 0.7497 | 1.0000 | 0.0000 | 8 |
137
+ | `coordinated_fraud` | 0.8230 | 1.0000 | 0.0000 | 12 |
138
+ | `distribution_shift_claim` | 0.7827 | 1.0000 | 0.0000 | 12 |
139
+ | `identity_fraud` | 0.8180 | 1.0000 | 0.0000 | 10 |
140
+ | **Average** | **0.8092** | **1.0000** | **0.0000** | — |
141
+
142
+ **Completion rate: 100%** across 25 episodes (variants 0–4 visited per task). This is a *scripted* baseline — fixed strategies per `task_id` — so seeds within a task return the same reward by design (see the note in `eval_report.md`). Its job is to prove the env's reward surface is reproducible and the live HTTP path works end-to-end; the *trained model's* improvement is in the component-shift table above.
143
+
144
  ---
145
 
146
  ## Quick Start for Reviewers (3 minutes)
 
368
  ├── docs/
369
  │ ├── reward_curve.svg ← training reward curve (embedded above)
370
  │ ├── component_shift.svg ← before/after component scores (embedded above)
371
+ │ └── BLOG.md ← writeup
372
 
373
  └── reports/
374
  ├── training_summary.json