Spaces:
Running
Running
deploy: update README.md
Browse files
README.md
CHANGED
|
@@ -1,451 +1,451 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: ClaimCourt — Insurance Calibration RL Environment
|
| 3 |
-
emoji: ⚖️
|
| 4 |
-
colorFrom: indigo
|
| 5 |
-
colorTo: purple
|
| 6 |
-
sdk: docker
|
| 7 |
-
app_port: 7860
|
| 8 |
-
pinned: true
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
# ClaimCourt — Insurance Calibration RL Environment
|
| 12 |
-
|
| 13 |
-
> *Codename in the repo & URLs: `debatefloor` — all GitHub, Hugging Face Space, and model-repo slugs use the original codename so existing links continue to resolve. The product is **ClaimCourt** everywhere it faces a human reader.*
|
| 14 |
-
|
| 15 |
-
[](https://github.com/AniketAslaliya/debateFloor)
|
| 16 |
-
[](https://huggingface.co/spaces/AniketAsla/debatefloor)
|
| 17 |
-
[](https://arxiv.org/abs/2604.12632)
|
| 18 |
-
[](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
|
| 20 |
-
|
| 21 |
-
> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
|
| 22 |
-
> Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## Problem Statement
|
| 27 |
-
|
| 28 |
-
LLMs deployed in high-stakes domains suffer from a well-documented failure mode: **overconfidence**. A model that approves or denies an insurance claim with 100 % certainty — but is wrong — causes real harm. The [CAPO paper (arXiv:2604.12632, 2026)](https://arxiv.org/abs/2604.12632) measures up to a 15 % AUC drop in standard GRPO training, and [DCPO (arXiv:2603.09117, 2026)](https://arxiv.org/abs/2603.09117) shows a 71 % Expected-Calibration-Error reduction is achievable when calibration is treated as a first-class objective.
|
| 29 |
-
|
| 30 |
-
**ClaimCourt is the direct fix.** It trains LLMs to declare *calibrated* confidence before every decision, using a reward surface that penalises overconfident wrong answers more severely than uncertain ones. This teaches models **when** to be confident, not just what to say.
|
| 31 |
-
|
| 32 |
-
Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore every year** ([BCG × Medi Assist, Nov 2025](https://www.business-standard.com/industry/news/insurance-fwa-drains-rs10000cr-each-year-bcg-mediassist-report-125112101199_1.html)) — about 8 % of all claim payouts. From April 2026, the [IRDAI Insurance Fraud Monitoring Framework Guidelines, 2025](https://irdai.gov.in/) make every insurer legally responsible for detecting it. AI is the obvious tool, but recent research ([CAPO, arXiv:2604.12632](https://arxiv.org/abs/2604.12632); [DCPO, arXiv:2603.09117](https://arxiv.org/abs/2603.09117)) proves standard GRPO training makes models *more* overconfident as they get more accurate — exactly the wrong direction for high-stakes claims work.
|
| 33 |
-
|
| 34 |
-
---
|
| 35 |
-
|
| 36 |
-
## Submission Artifacts
|
| 37 |
-
|
| 38 |
-
| Artifact | Link |
|
| 39 |
-
|---|---|
|
| 40 |
-
| **Live Environment (HF Space)** | https://huggingface.co/spaces/AniketAsla/debatefloor |
|
| 41 |
-
| **WandB
|
| 42 |
-
| **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
|
| 43 |
-
| **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
|
| 44 |
-
| **Mini-Blog** | [docs/HFBlogPost.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/HFBlogPost.md) |
|
| 45 |
-
|
| 46 |
-
---
|
| 47 |
-
|
| 48 |
-
## How This Submission Maps to the Judging Rubric
|
| 49 |
-
|
| 50 |
-
| Criterion | Weight | Where to find the evidence |
|
| 51 |
-
|---|---|---|
|
| 52 |
-
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 53 |
-
| **Storytelling & Presentation** | 30% | [`docs/HFBlogPost.md`](docs/HFBlogPost.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
|
| 54 |
-
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
|
| 55 |
-
| **Reward & Training Pipeline** | 10% | [`app/services/reward.py`](app/services/reward.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
|
| 56 |
-
|
| 57 |
-
### Minimum-requirement checklist (for judges)
|
| 58 |
-
|
| 59 |
-
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 60 |
-
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 61 |
-
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
|
| 62 |
-
- [x] **Mini-blog** at [`docs/HFBlogPost.md`](docs/HFBlogPost.md)
|
| 63 |
-
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 64 |
-
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
| 65 |
-
- [x] **`openenv.yaml`** manifest valid — see repo root
|
| 66 |
-
- [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `clients/`
|
| 67 |
-
|
| 68 |
-
---
|
| 69 |
-
|
| 70 |
-
## Results
|
| 71 |
-
|
| 72 |
-
All numbers below are **read directly from committed JSON artifacts** — no
|
| 73 |
-
hand-edits, no rounded-up estimates. Source:
|
| 74 |
-
[`reports/training_summary.json`](reports/training_summary.json),
|
| 75 |
-
[`reports/component_shift_summary.json`](reports/component_shift_summary.json).
|
| 76 |
-
|
| 77 |
-
### GRPO training — 5,000 episodes, Qwen2.5-0.5B-Instruct
|
| 78 |
-
|
| 79 |
-
| Config | Value |
|
| 80 |
-
|---|---|
|
| 81 |
-
| Episodes | 5,000 |
|
| 82 |
-
| Epochs | 1 |
|
| 83 |
-
| GRPO steps | 2,500 |
|
| 84 |
-
| Batch / Generations | 8 / 8 |
|
| 85 |
-
| Hardware | L4 GPU (HF Jobs), 3 h 3 min |
|
| 86 |
-
| WandB | [
|
| 87 |
-
|
| 88 |
-
### Headline result: training reward 0.130 → 0.469 (3.6× improvement)
|
| 89 |
-
|
| 90 |
-
### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
|
| 91 |
-
|
| 92 |
-
| Component | Before (untrained) | After (GRPO) | Change |
|
| 93 |
-
|---|---:|---:|---|
|
| 94 |
-
| **Decision accuracy** | 0.000 | **1.000** | **+1.000** |
|
| 95 |
-
| **Calibration** | 0.000 | **1.000** | **+1.000** |
|
| 96 |
-
| **Fraud detection** | 0.000 | **0.333** | +0.333 |
|
| 97 |
-
| Evidence quality | 0.333 | 0.333 | unchanged |
|
| 98 |
-
| Reasoning quality | 0.833 | 0.792 | −0.042 (within noise) |
|
| 99 |
-
|
| 100 |
-
The trained model learned to **make correct decisions with calibrated
|
| 101 |
-
confidence** — exactly the skill this environment is designed to teach.
|
| 102 |
-
Decision accuracy and calibration both went from zero to perfect on the
|
| 103 |
-
held-out eval set. The small dip in reasoning quality (−4 pts) is a
|
| 104 |
-
known trade-off: the model traded a sliver of fluency for sharper
|
| 105 |
-
decision-making.
|
| 106 |
-
|
| 107 |
-
### Training Plots
|
| 108 |
-
|
| 109 |
-

|
| 110 |
-
*Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
|
| 111 |
-
Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
|
| 112 |
-
[`reports/training_summary.json`](reports/training_summary.json).*
|
| 113 |
-
|
| 114 |
-

|
| 115 |
-
*Before vs after on held-out eval: Decision accuracy 0 → 1.0,
|
| 116 |
-
Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
|
| 117 |
-
[`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
|
| 118 |
-
|
| 119 |
-
---
|
| 120 |
-
|
| 121 |
-
## Quick Start for Reviewers (3 minutes)
|
| 122 |
-
|
| 123 |
-
1. **Open the live UI:** https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 124 |
-
2. **Select `contradictory_claim`** and click **Run Episode**.
|
| 125 |
-
3. Watch the agent: validate documents → flag fraud signals → **convene the Court Panel (Prosecutor vs Defender)** → declare MED confidence → deny claim.
|
| 126 |
-
4. The highlighted cell in the 3×2 matrix shows exactly why it scored what it scored.
|
| 127 |
-
|
| 128 |
-
---
|
| 129 |
-
|
| 130 |
-
## What Makes This Novel
|
| 131 |
-
|
| 132 |
-
- **Training environment, not a benchmark.** Episodes are procedurally generated from seeds — the agent cannot memorise answers.
|
| 133 |
-
- **Teaches calibration, not just accuracy.** Overconfident wrong answers are penalised harder than uncertain ones. No other OpenEnv environment has this.
|
| 134 |
-
- **Multi-agent by design.** The final decision is informed by the adversarial **Court Panel** (Prosecutor vs Defender) before the Judge commits. This is Fleet AI Scalable Oversight.
|
| 135 |
-
- **Anti-gaming system.** An agent cannot win by always saying LOW confidence or always saying HIGH. It must learn genuine calibration.
|
| 136 |
-
|
| 137 |
-
---
|
| 138 |
-
|
| 139 |
-
## Theme Coverage
|
| 140 |
-
|
| 141 |
-
| Theme | Bonus Prize | What We Built |
|
| 142 |
-
|-------|-------------|---------------|
|
| 143 |
-
| **Theme 3.1** — World Modeling (Professional) | Scaler AI Labs: Multi-App RL for Enterprise Workflows | 5 fraud types, multi-doc investigation, IRDAI registry, policy history |
|
| 144 |
-
| **Theme 1** — Multi-Agent Interactions | Fleet AI: Scalable Oversight | 3-agent Court Panel: Prosecutor + Defender + Judge |
|
| 145 |
-
| **Theme 4** — Self-Improvement | Curriculum / difficulty escalation | easy→medium→hard + anti-gaming detector |
|
| 146 |
-
|
| 147 |
-
---
|
| 148 |
-
|
| 149 |
-
## The Core Innovation: 3×2 Calibration Matrix
|
| 150 |
-
|
| 151 |
-
Before every terminal action, the agent must declare a confidence level: **HIGH**, **MED**, or **LOW**. The reward is determined by this matrix:
|
| 152 |
-
|
| 153 |
-
| Confidence | Correct Decision | Wrong Decision |
|
| 154 |
-
|------------|-----------------|----------------|
|
| 155 |
-
| **HIGH** | +1.0 | **−0.8** ← worst outcome |
|
| 156 |
-
| **MED** | +0.6 | −0.2 |
|
| 157 |
-
| **LOW** | +0.1 | 0.0 ← safe |
|
| 158 |
-
|
| 159 |
-
An agent that always says HIGH to maximise reward is catastrophically punished when wrong. An agent that always says LOW is caught by the anti-gaming system. **The only winning strategy is accurate calibration.**
|
| 160 |
-
|
| 161 |
-
Based on the [CoCA framework (arXiv:2603.05881)](https://arxiv.org/abs/2603.05881) — co-optimising confidence and accuracy via GRPO.
|
| 162 |
-
|
| 163 |
-
---
|
| 164 |
-
|
| 165 |
-
## The Court Panel — The Demo Centrepiece
|
| 166 |
-
|
| 167 |
-
> **No other environment in the OpenEnv hub has this mechanic.** Run `contradictory_claim` in the live UI to see it unfold.
|
| 168 |
-
|
| 169 |
-
**The 90-second sequence that wins the storytelling criterion:**
|
| 170 |
-
|
| 171 |
-
1. Agent validates 3 documents, discovers `date_mismatch` + `cost_inflation` fraud signals.
|
| 172 |
-
2. Agent calls `convene_debate_panel` — two sub-agents spin up from the evidence base.
|
| 173 |
-
3. **Prosecutor [STRONG]:** *"2 fraud signals, billing 2.4× standard rate — deny."*
|
| 174 |
-
4. **Defender [WEAK]:** *"Documents internally consistent, burden of proof requires more."*
|
| 175 |
-
5. Panel verdict: **Prosecution substantially outweighs defense.**
|
| 176 |
-
6. Agent reads transcript → declares **MED confidence** → `deny_claim` → scores **+0.6**.
|
| 177 |
-
7. The calibration matrix highlights `MED × correct`. The reviewer sees exactly why.
|
| 178 |
-
|
| 179 |
-
```
|
| 180 |
-
INVESTIGATOR
|
| 181 |
-
├── validate_document → discovers fraud signals
|
| 182 |
-
├── flag_fraud_signal → formally raises grounded signal
|
| 183 |
-
├── query_historical_data → reveals cross-claim patterns
|
| 184 |
-
└── Builds evidence base over N steps
|
| 185 |
-
↓
|
| 186 |
-
convene_debate_panel
|
| 187 |
-
↓
|
| 188 |
-
┌───────────────────┐ ┌────────────────────┐
|
| 189 |
-
│ PROSECUTOR │ │ DEFENDER │
|
| 190 |
-
│ • fraud signals │ │ • doc consistency │
|
| 191 |
-
│ • Strength: STRONG│ │ • Strength: WEAK │
|
| 192 |
-
└───────────────────┘ └────────────────────┘
|
| 193 |
-
↓
|
| 194 |
-
PANEL VERDICT → recommendation
|
| 195 |
-
↓
|
| 196 |
-
JUDGE: approve / deny / escalate
|
| 197 |
-
+ confidence: HIGH / MED / LOW
|
| 198 |
-
→ calibration_score via 3×2 matrix
|
| 199 |
-
```
|
| 200 |
-
|
| 201 |
-
---
|
| 202 |
-
|
| 203 |
-
## Why This Is the Right RL Task
|
| 204 |
-
|
| 205 |
-
ClaimCourt satisfies all three properties of a well-designed RL task:
|
| 206 |
-
|
| 207 |
-
- **Step-by-step:** The agent validates documents, queries history, flags signals, and uses the Court Panel before committing. Each step changes the information state.
|
| 208 |
-
- **Programmatically verifiable:** Ground truth is embedded in every generated episode (`staged_accident → deny_claim`). No human labeller needed.
|
| 209 |
-
- **Hard enough to matter:** Easy claims are solvable with 2 steps. Hard claims require discovering cross-claim fraud rings across linked sessions. The model must earn its confidence.
|
| 210 |
-
|
| 211 |
-
---
|
| 212 |
-
|
| 213 |
-
## The 3 Tasks
|
| 214 |
-
|
| 215 |
-
| Task | Difficulty | Max Steps | Correct Decision | Expected Confidence |
|
| 216 |
-
|------|-----------|-----------|-----------------|---------------------|
|
| 217 |
-
| `clean_claim` | Easy | 10 | `approve_claim` | HIGH |
|
| 218 |
-
| `contradictory_claim` | Medium | 18 | `deny_claim` | MED |
|
| 219 |
-
| `distribution_shift_claim` | Hard | 28 | `escalate_to_human` | LOW |
|
| 220 |
-
|
| 221 |
-
`distribution_shift_claim` looks clean on the surface. The agent must call `query_linked_claim` or `query_historical_data` to discover cross-claim fraud signals. If the agent declares HIGH confidence, it is **always penalised regardless of decision** — this task is designed to require epistemic humility.
|
| 222 |
-
|
| 223 |
-
---
|
| 224 |
-
|
| 225 |
-
## Procedural Generation
|
| 226 |
-
|
| 227 |
-
A benchmark has fixed episodes. ClaimCourt generates them procedurally:
|
| 228 |
-
|
| 229 |
-
```python
|
| 230 |
-
from server.claim_generator import generate_claim
|
| 231 |
-
|
| 232 |
-
# Same inputs → same episode (deterministic, reproducible)
|
| 233 |
-
episode = generate_claim(seed=42, fraud_type="medical_inflation",
|
| 234 |
-
coverage_type="health", difficulty="medium")
|
| 235 |
-
```
|
| 236 |
-
|
| 237 |
-
**5 fraud types × 4 coverage types × 3 jurisdictions × seed variation = 500+ unique training episodes**
|
| 238 |
-
|
| 239 |
-
| Fraud Type | Ground Truth | Key Signal |
|
| 240 |
-
|-----------|-------------|------------|
|
| 241 |
-
| `staged_accident` | `deny_claim` | Cost mismatch between damage and repair estimate |
|
| 242 |
-
| `medical_inflation` | `deny_claim` | Procedure in bill ≠ procedure in discharge summary |
|
| 243 |
-
| `identity_fraud` | `deny_claim` | Ghost claimant, policy opened 5 days before incident |
|
| 244 |
-
| `coordinated_ring` | `escalate_to_human` | Shared broker across 3–5 simultaneous claims |
|
| 245 |
-
| `phantom_provider` | `deny_claim` | Hospital not in IRDAI registry, invalid GST |
|
| 246 |
-
|
| 247 |
-
---
|
| 248 |
-
|
| 249 |
-
## Reward Design
|
| 250 |
-
|
| 251 |
-
### Training Reward (use for GRPO — simple scalar for stable gradients)
|
| 252 |
-
|
| 253 |
-
```python
|
| 254 |
-
def training_reward(decision, confidence, ground_truth, legitimate_flags, step_num, done):
|
| 255 |
-
r = -0.05 # step penalty (efficiency)
|
| 256 |
-
if done:
|
| 257 |
-
r += 1.0 if correct else -0.5 # decision accuracy
|
| 258 |
-
r += 0.3 * min(legitimate_flags, 3) # fraud signal detection
|
| 259 |
-
r += 0.5 * calibration_matrix[(confidence, correct)] # calibration bonus
|
| 260 |
-
return r
|
| 261 |
-
```
|
| 262 |
-
|
| 263 |
-
### Evaluation Reward (for demo and reporting only — do not use for GRPO)
|
| 264 |
-
|
| 265 |
-
```python
|
| 266 |
-
def eval_reward(episode):
|
| 267 |
-
return (0.35 * calibration_reward # confidence accuracy
|
| 268 |
-
+ 0.25 * escalation_reward # appropriate uncertainty escalation
|
| 269 |
-
+ 0.20 * evidence_quality # grounded signal citations
|
| 270 |
-
+ 0.10 * efficiency_score # step efficiency
|
| 271 |
-
- 0.10 * gaming_penalty) # anti-gaming deduction
|
| 272 |
-
```
|
| 273 |
-
|
| 274 |
-
### Anti-Gaming System
|
| 275 |
-
|
| 276 |
-
```
|
| 277 |
-
if LOW_rate > 70% across 10+ episodes: penalty = (rate − 0.70) × 2.0
|
| 278 |
-
if HIGH_rate > 80% across 10+ episodes: penalty = (rate − 0.80) × 1.5
|
| 279 |
-
```
|
| 280 |
-
|
| 281 |
-
---
|
| 282 |
-
|
| 283 |
-
## Training Pipeline
|
| 284 |
-
|
| 285 |
-
**Model:** `Qwen/Qwen2.5-0.5B-Instruct` — open-source, no OpenAI API
|
| 286 |
-
**Algorithm:** HF TRL `GRPOTrainer` + Unsloth 4-bit QLoRA (Group Relative Policy Optimization — same as DeepSeek-R1)
|
| 287 |
-
**Full run:** L4 GPU on HF Jobs — 5,000 episodes, 2,500 steps, 3 h 3 min
|
| 288 |
-
**Quick run:** Free Colab T4 GPU — 100 episodes, ~15 min (see notebook)
|
| 289 |
-
**WandB
|
| 290 |
-
|
| 291 |
-
```bash
|
| 292 |
-
# Reproduce the training run
|
| 293 |
-
git clone https://github.com/AniketAslaliya/debateFloor.git && cd debateFloor
|
| 294 |
-
|
| 295 |
-
# Use the canonical pinned requirements files (every dep verified to
|
| 296 |
-
# import inside train_minimal.py and the env server).
|
| 297 |
-
pip install -r requirements.txt # env server deps (FastAPI, openenv-core, ...)
|
| 298 |
-
pip install -r train/requirements.txt # training deps (trl, unsloth, peft, wandb, ...)
|
| 299 |
-
|
| 300 |
-
# Optional (Colab T4): swap the pinned unsloth for the colab-new wheel
|
| 301 |
-
# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
|
| 302 |
-
#
|
| 303 |
-
# If you see: ModuleNotFoundError: No module named 'mergekit' when importing
|
| 304 |
-
# GRPOTrainer — you skipped train/requirements.txt. Re-run: pip install -r train/requirements.txt
|
| 305 |
-
# (mergekit is required by recent TRL for the GRPO import path.)
|
| 306 |
-
|
| 307 |
-
PYTHONPATH=. python train/train_minimal.py
|
| 308 |
-
```
|
| 309 |
-
|
| 310 |
-
Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
|
| 311 |
-
|
| 312 |
-
Artifacts generated after training:
|
| 313 |
-
- `docs/reward_curve.svg`
|
| 314 |
-
- `docs/component_shift.svg`
|
| 315 |
-
- `reports/training_summary.json`
|
| 316 |
-
|
| 317 |
-
---
|
| 318 |
-
|
| 319 |
-
## Architecture & Code Map
|
| 320 |
-
|
| 321 |
-
**ClaimCourt** — after `git clone`, your working directory is `debateFloor/` (GitHub repo name; codename `debatefloor` in HF/WandB URLs).
|
| 322 |
-
|
| 323 |
-
```
|
| 324 |
-
debateFloor/
|
| 325 |
-
├── openenv.yaml ← OpenEnv spec manifest
|
| 326 |
-
├── Dockerfile ← HF Space deployment
|
| 327 |
-
├── requirements.txt
|
| 328 |
-
│
|
| 329 |
-
├── app/ ← FastAPI server (OpenEnv contract)
|
| 330 |
-
│ ├── main.py ← /reset /step /state /tasks /health /schema
|
| 331 |
-
│ ├── environment.py ← InsuranceClaimEnvironment + Court Panel
|
| 332 |
-
│ ├── models.py ← Pydantic action/observation models
|
| 333 |
-
│ └── tasks.py ← task definitions
|
| 334 |
-
│
|
| 335 |
-
├── server/ ← ClaimCourt core
|
| 336 |
-
│ ├── calibration_grader.py ← 3×2 matrix + anti-gaming + training/eval reward
|
| 337 |
-
│ └── claim_generator.py ← procedural episode generator (500+ episodes)
|
| 338 |
-
│
|
| 339 |
-
├── train/
|
| 340 |
-
│ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
|
| 341 |
-
│ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
|
| 342 |
-
│
|
| 343 |
-
├── docs/
|
| 344 |
-
│ ├── reward_curve.svg ← training reward curve (embedded above)
|
| 345 |
-
│ ├── component_shift.svg ← before/after component scores (embedded above)
|
| 346 |
-
│ └── HFBlogPost.md ← writeup
|
| 347 |
-
│
|
| 348 |
-
└── reports/
|
| 349 |
-
├── training_summary.json
|
| 350 |
-
└── component_shift_summary.json
|
| 351 |
-
```
|
| 352 |
-
|
| 353 |
-
---
|
| 354 |
-
|
| 355 |
-
## Quickstart
|
| 356 |
-
|
| 357 |
-
### Run locally
|
| 358 |
-
|
| 359 |
-
```bash
|
| 360 |
-
git clone https://github.com/AniketAslaliya/debateFloor.git
|
| 361 |
-
cd debateFloor
|
| 362 |
-
pip install -r requirements.txt
|
| 363 |
-
PYTHONPATH=. uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload
|
| 364 |
-
```
|
| 365 |
-
|
| 366 |
-
### Run with Docker
|
| 367 |
-
|
| 368 |
-
```bash
|
| 369 |
-
docker build -t claimcourt .
|
| 370 |
-
docker run -p 7860:7860 claimcourt
|
| 371 |
-
```
|
| 372 |
-
|
| 373 |
-
---
|
| 374 |
-
|
| 375 |
-
## API Reference
|
| 376 |
-
|
| 377 |
-
All endpoints follow the OpenEnv REST contract:
|
| 378 |
-
|
| 379 |
-
| Method | Endpoint | Description |
|
| 380 |
-
|--------|----------|-------------|
|
| 381 |
-
| `POST` | `/reset` | Start new episode. Accepts `task_id`, `seed`, `session_id`. |
|
| 382 |
-
| `POST` | `/step` | Submit action. Requires `session_id` and `action` body. |
|
| 383 |
-
| `GET` | `/state` | Current episode state. |
|
| 384 |
-
| `GET` | `/tasks` | Lists all tasks with objectives. |
|
| 385 |
-
| `GET` | `/schema` | JSON schema for action/observation/state. |
|
| 386 |
-
| `GET` | `/health` | Returns `{"status": "healthy", "active_sessions": N}`. |
|
| 387 |
-
|
| 388 |
-
### Example Episode
|
| 389 |
-
|
| 390 |
-
```python
|
| 391 |
-
import requests
|
| 392 |
-
|
| 393 |
-
BASE = "https://aniketasla-debatefloor.hf.space"
|
| 394 |
-
|
| 395 |
-
r = requests.post(f"{BASE}/reset", json={"task_id": "contradictory_claim", "seed": 42})
|
| 396 |
-
session_id = r.json()["session_id"]
|
| 397 |
-
|
| 398 |
-
def step(action):
|
| 399 |
-
return requests.post(f"{BASE}/step", json={"action": action, "session_id": session_id}).json()
|
| 400 |
-
|
| 401 |
-
step({"action_type": "validate_document", "parameters": {"doc_id": "DOC-001"}, "reasoning": "check bill"})
|
| 402 |
-
step({"action_type": "flag_fraud_signal", "parameters": {"flag_id": "procedure_mismatch",
|
| 403 |
-
"evidence": "discharge says appendectomy, bill says cardiac bypass"}, "reasoning": "billing fraud"})
|
| 404 |
-
|
| 405 |
-
resp = step({"action_type": "deny_claim", "confidence": "MED", "reason": "procedure mismatch confirmed"})
|
| 406 |
-
print(f"Reward: {resp['reward']}")
|
| 407 |
-
print(f"Calibration: {resp['observation']['reward_breakdown']['calibration_score']}")
|
| 408 |
-
```
|
| 409 |
-
|
| 410 |
-
---
|
| 411 |
-
|
| 412 |
-
## OpenEnv Spec Compliance
|
| 413 |
-
|
| 414 |
-
| Requirement | Status |
|
| 415 |
-
|-------------|--------|
|
| 416 |
-
| `spec_version: 1` | ✅ |
|
| 417 |
-
| OpenEnv `Environment` base class | ✅ |
|
| 418 |
-
| `/reset`, `/step`, `/state`, `/tasks`, `/health`, `/schema` | ✅ |
|
| 419 |
-
| `supports_concurrent_sessions: true` | ✅ |
|
| 420 |
-
| `max_concurrent_envs: 64` | ✅ |
|
| 421 |
-
| `confidence_required: true` | ✅ |
|
| 422 |
-
| `procedural_generation: true` | ✅ |
|
| 423 |
-
| `episode_pool_size: 500` | ✅ |
|
| 424 |
-
| Reward in `[0.0, 1.0]` | ✅ |
|
| 425 |
-
| Docker deployment | ✅ |
|
| 426 |
-
|
| 427 |
-
---
|
| 428 |
-
|
| 429 |
-
## Team
|
| 430 |
-
|
| 431 |
-
- **Aniket Aslaliya** — Environment Core, Claim Generator, Calibration Grader, UI
|
| 432 |
-
- **Mitali Mehta** — Domain Knowledge (Fraud types, IRDAI regulations), Grader Design
|
| 433 |
-
- **Aditya Sharma** — Training Pipeline, GRPO Notebook, WandB Integration
|
| 434 |
-
|
| 435 |
-
---
|
| 436 |
-
|
| 437 |
-
## Citation
|
| 438 |
-
|
| 439 |
-
```bibtex
|
| 440 |
-
@article{coca2025,
|
| 441 |
-
title={Co-optimizing Confidence and Accuracy via Segment-Specific GRPO Rewards},
|
| 442 |
-
author={...},
|
| 443 |
-
journal={arXiv:2603.05881},
|
| 444 |
-
year={2025}
|
| 445 |
-
}
|
| 446 |
-
```
|
| 447 |
-
|
| 448 |
-
**Related:**
|
| 449 |
-
- CAPO paper (April 2026) — GRPO induces overconfidence; ClaimCourt is the fix
|
| 450 |
-
- OpenEnv: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
|
| 451 |
-
- TRL GRPOTrainer: [huggingface.co/docs/trl/grpo_trainer](https://huggingface.co/docs/trl/grpo_trainer)
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: ClaimCourt — Insurance Calibration RL Environment
|
| 3 |
+
emoji: ⚖️
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
pinned: true
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# ClaimCourt — Insurance Calibration RL Environment
|
| 12 |
+
|
| 13 |
+
> *Codename in the repo & URLs: `debatefloor` — all GitHub, Hugging Face Space, and model-repo slugs use the original codename so existing links continue to resolve. The product is **ClaimCourt** everywhere it faces a human reader.*
|
| 14 |
+
|
| 15 |
+
[](https://github.com/AniketAslaliya/debateFloor)
|
| 16 |
+
[](https://huggingface.co/spaces/AniketAsla/debatefloor)
|
| 17 |
+
[](https://arxiv.org/abs/2604.12632)
|
| 18 |
+
[](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl)
|
| 19 |
+
[](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
|
| 20 |
+
|
| 21 |
+
> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
|
| 22 |
+
> Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Problem Statement
|
| 27 |
+
|
| 28 |
+
LLMs deployed in high-stakes domains suffer from a well-documented failure mode: **overconfidence**. A model that approves or denies an insurance claim with 100 % certainty — but is wrong — causes real harm. The [CAPO paper (arXiv:2604.12632, 2026)](https://arxiv.org/abs/2604.12632) measures up to a 15 % AUC drop in standard GRPO training, and [DCPO (arXiv:2603.09117, 2026)](https://arxiv.org/abs/2603.09117) shows a 71 % Expected-Calibration-Error reduction is achievable when calibration is treated as a first-class objective.
|
| 29 |
+
|
| 30 |
+
**ClaimCourt is the direct fix.** It trains LLMs to declare *calibrated* confidence before every decision, using a reward surface that penalises overconfident wrong answers more severely than uncertain ones. This teaches models **when** to be confident, not just what to say.
|
| 31 |
+
|
| 32 |
+
Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore every year** ([BCG × Medi Assist, Nov 2025](https://www.business-standard.com/industry/news/insurance-fwa-drains-rs10000cr-each-year-bcg-mediassist-report-125112101199_1.html)) — about 8 % of all claim payouts. From April 2026, the [IRDAI Insurance Fraud Monitoring Framework Guidelines, 2025](https://irdai.gov.in/) make every insurer legally responsible for detecting it. AI is the obvious tool, but recent research ([CAPO, arXiv:2604.12632](https://arxiv.org/abs/2604.12632); [DCPO, arXiv:2603.09117](https://arxiv.org/abs/2603.09117)) proves standard GRPO training makes models *more* overconfident as they get more accurate — exactly the wrong direction for high-stakes claims work.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Submission Artifacts
|
| 37 |
+
|
| 38 |
+
| Artifact | Link |
|
| 39 |
+
|---|---|
|
| 40 |
+
| **Live Environment (HF Space)** | https://huggingface.co/spaces/AniketAsla/debatefloor |
|
| 41 |
+
| **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
|
| 42 |
+
| **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
|
| 43 |
+
| **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
|
| 44 |
+
| **Mini-Blog** | [docs/HFBlogPost.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/HFBlogPost.md) |
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## How This Submission Maps to the Judging Rubric
|
| 49 |
+
|
| 50 |
+
| Criterion | Weight | Where to find the evidence |
|
| 51 |
+
|---|---|---|
|
| 52 |
+
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 53 |
+
| **Storytelling & Presentation** | 30% | [`docs/HFBlogPost.md`](docs/HFBlogPost.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
|
| 54 |
+
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
|
| 55 |
+
| **Reward & Training Pipeline** | 10% | [`app/services/reward.py`](app/services/reward.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
|
| 56 |
+
|
| 57 |
+
### Minimum-requirement checklist (for judges)
|
| 58 |
+
|
| 59 |
+
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 60 |
+
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 61 |
+
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
|
| 62 |
+
- [x] **Mini-blog** at [`docs/HFBlogPost.md`](docs/HFBlogPost.md)
|
| 63 |
+
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 64 |
+
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
| 65 |
+
- [x] **`openenv.yaml`** manifest valid — see repo root
|
| 66 |
+
- [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `clients/`
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Results
|
| 71 |
+
|
| 72 |
+
All numbers below are **read directly from committed JSON artifacts** — no
|
| 73 |
+
hand-edits, no rounded-up estimates. Source:
|
| 74 |
+
[`reports/training_summary.json`](reports/training_summary.json),
|
| 75 |
+
[`reports/component_shift_summary.json`](reports/component_shift_summary.json).
|
| 76 |
+
|
| 77 |
+
### GRPO training — 5,000 episodes, Qwen2.5-0.5B-Instruct
|
| 78 |
+
|
| 79 |
+
| Config | Value |
|
| 80 |
+
|---|---|
|
| 81 |
+
| Episodes | 5,000 |
|
| 82 |
+
| Epochs | 1 |
|
| 83 |
+
| GRPO steps | 2,500 |
|
| 84 |
+
| Batch / Generations | 8 / 8 |
|
| 85 |
+
| Hardware | L4 GPU (HF Jobs), 3 h 3 min |
|
| 86 |
+
| WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
|
| 87 |
+
|
| 88 |
+
### Headline result: training reward 0.130 → 0.469 (3.6× improvement)
|
| 89 |
+
|
| 90 |
+
### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
|
| 91 |
+
|
| 92 |
+
| Component | Before (untrained) | After (GRPO) | Change |
|
| 93 |
+
|---|---:|---:|---|
|
| 94 |
+
| **Decision accuracy** | 0.000 | **1.000** | **+1.000** |
|
| 95 |
+
| **Calibration** | 0.000 | **1.000** | **+1.000** |
|
| 96 |
+
| **Fraud detection** | 0.000 | **0.333** | +0.333 |
|
| 97 |
+
| Evidence quality | 0.333 | 0.333 | unchanged |
|
| 98 |
+
| Reasoning quality | 0.833 | 0.792 | −0.042 (within noise) |
|
| 99 |
+
|
| 100 |
+
The trained model learned to **make correct decisions with calibrated
|
| 101 |
+
confidence** — exactly the skill this environment is designed to teach.
|
| 102 |
+
Decision accuracy and calibration both went from zero to perfect on the
|
| 103 |
+
held-out eval set. The small dip in reasoning quality (−4 pts) is a
|
| 104 |
+
known trade-off: the model traded a sliver of fluency for sharper
|
| 105 |
+
decision-making.
|
| 106 |
+
|
| 107 |
+
### Training Plots
|
| 108 |
+
|
| 109 |
+

|
| 110 |
+
*Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
|
| 111 |
+
Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
|
| 112 |
+
[`reports/training_summary.json`](reports/training_summary.json).*
|
| 113 |
+
|
| 114 |
+

|
| 115 |
+
*Before vs after on held-out eval: Decision accuracy 0 → 1.0,
|
| 116 |
+
Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
|
| 117 |
+
[`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
+
|
| 121 |
+
## Quick Start for Reviewers (3 minutes)
|
| 122 |
+
|
| 123 |
+
1. **Open the live UI:** https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 124 |
+
2. **Select `contradictory_claim`** and click **Run Episode**.
|
| 125 |
+
3. Watch the agent: validate documents → flag fraud signals → **convene the Court Panel (Prosecutor vs Defender)** → declare MED confidence → deny claim.
|
| 126 |
+
4. The highlighted cell in the 3×2 matrix shows exactly why it scored what it scored.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## What Makes This Novel
|
| 131 |
+
|
| 132 |
+
- **Training environment, not a benchmark.** Episodes are procedurally generated from seeds — the agent cannot memorise answers.
|
| 133 |
+
- **Teaches calibration, not just accuracy.** Overconfident wrong answers are penalised harder than uncertain ones. No other OpenEnv environment has this.
|
| 134 |
+
- **Multi-agent by design.** The final decision is informed by the adversarial **Court Panel** (Prosecutor vs Defender) before the Judge commits. This is Fleet AI Scalable Oversight.
|
| 135 |
+
- **Anti-gaming system.** An agent cannot win by always saying LOW confidence or always saying HIGH. It must learn genuine calibration.
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## Theme Coverage
|
| 140 |
+
|
| 141 |
+
| Theme | Bonus Prize | What We Built |
|
| 142 |
+
|-------|-------------|---------------|
|
| 143 |
+
| **Theme 3.1** — World Modeling (Professional) | Scaler AI Labs: Multi-App RL for Enterprise Workflows | 5 fraud types, multi-doc investigation, IRDAI registry, policy history |
|
| 144 |
+
| **Theme 1** — Multi-Agent Interactions | Fleet AI: Scalable Oversight | 3-agent Court Panel: Prosecutor + Defender + Judge |
|
| 145 |
+
| **Theme 4** — Self-Improvement | Curriculum / difficulty escalation | easy→medium→hard + anti-gaming detector |
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## The Core Innovation: 3×2 Calibration Matrix
|
| 150 |
+
|
| 151 |
+
Before every terminal action, the agent must declare a confidence level: **HIGH**, **MED**, or **LOW**. The reward is determined by this matrix:
|
| 152 |
+
|
| 153 |
+
| Confidence | Correct Decision | Wrong Decision |
|
| 154 |
+
|------------|-----------------|----------------|
|
| 155 |
+
| **HIGH** | +1.0 | **−0.8** ← worst outcome |
|
| 156 |
+
| **MED** | +0.6 | −0.2 |
|
| 157 |
+
| **LOW** | +0.1 | 0.0 ← safe |
|
| 158 |
+
|
| 159 |
+
An agent that always says HIGH to maximise reward is catastrophically punished when wrong. An agent that always says LOW is caught by the anti-gaming system. **The only winning strategy is accurate calibration.**
|
| 160 |
+
|
| 161 |
+
Based on the [CoCA framework (arXiv:2603.05881)](https://arxiv.org/abs/2603.05881) — co-optimising confidence and accuracy via GRPO.
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## The Court Panel — The Demo Centrepiece
|
| 166 |
+
|
| 167 |
+
> **No other environment in the OpenEnv hub has this mechanic.** Run `contradictory_claim` in the live UI to see it unfold.
|
| 168 |
+
|
| 169 |
+
**The 90-second sequence that wins the storytelling criterion:**
|
| 170 |
+
|
| 171 |
+
1. Agent validates 3 documents, discovers `date_mismatch` + `cost_inflation` fraud signals.
|
| 172 |
+
2. Agent calls `convene_debate_panel` — two sub-agents spin up from the evidence base.
|
| 173 |
+
3. **Prosecutor [STRONG]:** *"2 fraud signals, billing 2.4× standard rate — deny."*
|
| 174 |
+
4. **Defender [WEAK]:** *"Documents internally consistent, burden of proof requires more."*
|
| 175 |
+
5. Panel verdict: **Prosecution substantially outweighs defense.**
|
| 176 |
+
6. Agent reads transcript → declares **MED confidence** → `deny_claim` → scores **+0.6**.
|
| 177 |
+
7. The calibration matrix highlights `MED × correct`. The reviewer sees exactly why.
|
| 178 |
+
|
| 179 |
+
```
|
| 180 |
+
INVESTIGATOR
|
| 181 |
+
├── validate_document → discovers fraud signals
|
| 182 |
+
├── flag_fraud_signal → formally raises grounded signal
|
| 183 |
+
├── query_historical_data → reveals cross-claim patterns
|
| 184 |
+
└── Builds evidence base over N steps
|
| 185 |
+
↓
|
| 186 |
+
convene_debate_panel
|
| 187 |
+
↓
|
| 188 |
+
┌───────────────────┐ ┌────────────────────┐
|
| 189 |
+
│ PROSECUTOR │ │ DEFENDER │
|
| 190 |
+
│ • fraud signals │ │ • doc consistency │
|
| 191 |
+
│ • Strength: STRONG│ │ • Strength: WEAK │
|
| 192 |
+
└───────────────────┘ └────────────────────┘
|
| 193 |
+
↓
|
| 194 |
+
PANEL VERDICT → recommendation
|
| 195 |
+
↓
|
| 196 |
+
JUDGE: approve / deny / escalate
|
| 197 |
+
+ confidence: HIGH / MED / LOW
|
| 198 |
+
→ calibration_score via 3×2 matrix
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
## Why This Is the Right RL Task
|
| 204 |
+
|
| 205 |
+
ClaimCourt satisfies all three properties of a well-designed RL task:
|
| 206 |
+
|
| 207 |
+
- **Step-by-step:** The agent validates documents, queries history, flags signals, and uses the Court Panel before committing. Each step changes the information state.
|
| 208 |
+
- **Programmatically verifiable:** Ground truth is embedded in every generated episode (`staged_accident → deny_claim`). No human labeller needed.
|
| 209 |
+
- **Hard enough to matter:** Easy claims are solvable with 2 steps. Hard claims require discovering cross-claim fraud rings across linked sessions. The model must earn its confidence.
|
| 210 |
+
|
| 211 |
+
---
|
| 212 |
+
|
| 213 |
+
## The 3 Tasks
|
| 214 |
+
|
| 215 |
+
| Task | Difficulty | Max Steps | Correct Decision | Expected Confidence |
|
| 216 |
+
|------|-----------|-----------|-----------------|---------------------|
|
| 217 |
+
| `clean_claim` | Easy | 10 | `approve_claim` | HIGH |
|
| 218 |
+
| `contradictory_claim` | Medium | 18 | `deny_claim` | MED |
|
| 219 |
+
| `distribution_shift_claim` | Hard | 28 | `escalate_to_human` | LOW |
|
| 220 |
+
|
| 221 |
+
`distribution_shift_claim` looks clean on the surface. The agent must call `query_linked_claim` or `query_historical_data` to discover cross-claim fraud signals. If the agent declares HIGH confidence, it is **always penalised regardless of decision** — this task is designed to require epistemic humility.
|
| 222 |
+
|
| 223 |
+
---
|
| 224 |
+
|
| 225 |
+
## Procedural Generation
|
| 226 |
+
|
| 227 |
+
A benchmark has fixed episodes. ClaimCourt generates them procedurally:
|
| 228 |
+
|
| 229 |
+
```python
|
| 230 |
+
from server.claim_generator import generate_claim
|
| 231 |
+
|
| 232 |
+
# Same inputs → same episode (deterministic, reproducible)
|
| 233 |
+
episode = generate_claim(seed=42, fraud_type="medical_inflation",
|
| 234 |
+
coverage_type="health", difficulty="medium")
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
**5 fraud types × 4 coverage types × 3 jurisdictions × seed variation = 500+ unique training episodes**
|
| 238 |
+
|
| 239 |
+
| Fraud Type | Ground Truth | Key Signal |
|
| 240 |
+
|-----------|-------------|------------|
|
| 241 |
+
| `staged_accident` | `deny_claim` | Cost mismatch between damage and repair estimate |
|
| 242 |
+
| `medical_inflation` | `deny_claim` | Procedure in bill ≠ procedure in discharge summary |
|
| 243 |
+
| `identity_fraud` | `deny_claim` | Ghost claimant, policy opened 5 days before incident |
|
| 244 |
+
| `coordinated_ring` | `escalate_to_human` | Shared broker across 3–5 simultaneous claims |
|
| 245 |
+
| `phantom_provider` | `deny_claim` | Hospital not in IRDAI registry, invalid GST |
|
| 246 |
+
|
| 247 |
+
---
|
| 248 |
+
|
| 249 |
+
## Reward Design
|
| 250 |
+
|
| 251 |
+
### Training Reward (use for GRPO — simple scalar for stable gradients)
|
| 252 |
+
|
| 253 |
+
```python
|
| 254 |
+
def training_reward(decision, confidence, ground_truth, legitimate_flags, step_num, done):
|
| 255 |
+
r = -0.05 # step penalty (efficiency)
|
| 256 |
+
if done:
|
| 257 |
+
r += 1.0 if correct else -0.5 # decision accuracy
|
| 258 |
+
r += 0.3 * min(legitimate_flags, 3) # fraud signal detection
|
| 259 |
+
r += 0.5 * calibration_matrix[(confidence, correct)] # calibration bonus
|
| 260 |
+
return r
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
### Evaluation Reward (for demo and reporting only — do not use for GRPO)
|
| 264 |
+
|
| 265 |
+
```python
|
| 266 |
+
def eval_reward(episode):
|
| 267 |
+
return (0.35 * calibration_reward # confidence accuracy
|
| 268 |
+
+ 0.25 * escalation_reward # appropriate uncertainty escalation
|
| 269 |
+
+ 0.20 * evidence_quality # grounded signal citations
|
| 270 |
+
+ 0.10 * efficiency_score # step efficiency
|
| 271 |
+
- 0.10 * gaming_penalty) # anti-gaming deduction
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
### Anti-Gaming System
|
| 275 |
+
|
| 276 |
+
```
|
| 277 |
+
if LOW_rate > 70% across 10+ episodes: penalty = (rate − 0.70) × 2.0
|
| 278 |
+
if HIGH_rate > 80% across 10+ episodes: penalty = (rate − 0.80) × 1.5
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
---
|
| 282 |
+
|
| 283 |
+
## Training Pipeline
|
| 284 |
+
|
| 285 |
+
**Model:** `Qwen/Qwen2.5-0.5B-Instruct` — open-source, no OpenAI API
|
| 286 |
+
**Algorithm:** HF TRL `GRPOTrainer` + Unsloth 4-bit QLoRA (Group Relative Policy Optimization — same as DeepSeek-R1)
|
| 287 |
+
**Full run:** L4 GPU on HF Jobs — 5,000 episodes, 2,500 steps, 3 h 3 min
|
| 288 |
+
**Quick run:** Free Colab T4 GPU — 100 episodes, ~15 min (see notebook)
|
| 289 |
+
**WandB:** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl — pick the latest `grpo-qwen0.5b-env-connected` run. Plots in this README come from committed `reports/training_summary.json` (not from a pinned WandB run ID).
|
| 290 |
+
|
| 291 |
+
```bash
|
| 292 |
+
# Reproduce the training run
|
| 293 |
+
git clone https://github.com/AniketAslaliya/debateFloor.git && cd debateFloor
|
| 294 |
+
|
| 295 |
+
# Use the canonical pinned requirements files (every dep verified to
|
| 296 |
+
# import inside train_minimal.py and the env server).
|
| 297 |
+
pip install -r requirements.txt # env server deps (FastAPI, openenv-core, ...)
|
| 298 |
+
pip install -r train/requirements.txt # training deps (trl, unsloth, peft, wandb, ...)
|
| 299 |
+
|
| 300 |
+
# Optional (Colab T4): swap the pinned unsloth for the colab-new wheel
|
| 301 |
+
# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
|
| 302 |
+
#
|
| 303 |
+
# If you see: ModuleNotFoundError: No module named 'mergekit' when importing
|
| 304 |
+
# GRPOTrainer — you skipped train/requirements.txt. Re-run: pip install -r train/requirements.txt
|
| 305 |
+
# (mergekit is required by recent TRL for the GRPO import path.)
|
| 306 |
+
|
| 307 |
+
PYTHONPATH=. python train/train_minimal.py
|
| 308 |
+
```
|
| 309 |
+
|
| 310 |
+
Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
|
| 311 |
+
|
| 312 |
+
Artifacts generated after training:
|
| 313 |
+
- `docs/reward_curve.svg`
|
| 314 |
+
- `docs/component_shift.svg`
|
| 315 |
+
- `reports/training_summary.json`
|
| 316 |
+
|
| 317 |
+
---
|
| 318 |
+
|
| 319 |
+
## Architecture & Code Map
|
| 320 |
+
|
| 321 |
+
**ClaimCourt** — after `git clone`, your working directory is `debateFloor/` (GitHub repo name; codename `debatefloor` in HF/WandB URLs).
|
| 322 |
+
|
| 323 |
+
```
|
| 324 |
+
debateFloor/
|
| 325 |
+
├── openenv.yaml ← OpenEnv spec manifest
|
| 326 |
+
├── Dockerfile ← HF Space deployment
|
| 327 |
+
├── requirements.txt
|
| 328 |
+
│
|
| 329 |
+
├── app/ ← FastAPI server (OpenEnv contract)
|
| 330 |
+
│ ├── main.py ← /reset /step /state /tasks /health /schema
|
| 331 |
+
│ ├── environment.py ← InsuranceClaimEnvironment + Court Panel
|
| 332 |
+
│ ├── models.py ← Pydantic action/observation models
|
| 333 |
+
│ └── tasks.py ← task definitions
|
| 334 |
+
│
|
| 335 |
+
├── server/ ← ClaimCourt core
|
| 336 |
+
│ ├── calibration_grader.py ← 3×2 matrix + anti-gaming + training/eval reward
|
| 337 |
+
│ └── claim_generator.py ← procedural episode generator (500+ episodes)
|
| 338 |
+
│
|
| 339 |
+
├── train/
|
| 340 |
+
│ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
|
| 341 |
+
│ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
|
| 342 |
+
│
|
| 343 |
+
├── docs/
|
| 344 |
+
│ ├── reward_curve.svg ← training reward curve (embedded above)
|
| 345 |
+
│ ├── component_shift.svg ← before/after component scores (embedded above)
|
| 346 |
+
│ └── HFBlogPost.md ← writeup
|
| 347 |
+
│
|
| 348 |
+
└── reports/
|
| 349 |
+
├── training_summary.json
|
| 350 |
+
└── component_shift_summary.json
|
| 351 |
+
```
|
| 352 |
+
|
| 353 |
+
---
|
| 354 |
+
|
| 355 |
+
## Quickstart
|
| 356 |
+
|
| 357 |
+
### Run locally
|
| 358 |
+
|
| 359 |
+
```bash
|
| 360 |
+
git clone https://github.com/AniketAslaliya/debateFloor.git
|
| 361 |
+
cd debateFloor
|
| 362 |
+
pip install -r requirements.txt
|
| 363 |
+
PYTHONPATH=. uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload
|
| 364 |
+
```
|
| 365 |
+
|
| 366 |
+
### Run with Docker
|
| 367 |
+
|
| 368 |
+
```bash
|
| 369 |
+
docker build -t claimcourt .
|
| 370 |
+
docker run -p 7860:7860 claimcourt
|
| 371 |
+
```
|
| 372 |
+
|
| 373 |
+
---
|
| 374 |
+
|
| 375 |
+
## API Reference
|
| 376 |
+
|
| 377 |
+
All endpoints follow the OpenEnv REST contract:
|
| 378 |
+
|
| 379 |
+
| Method | Endpoint | Description |
|
| 380 |
+
|--------|----------|-------------|
|
| 381 |
+
| `POST` | `/reset` | Start new episode. Accepts `task_id`, `seed`, `session_id`. |
|
| 382 |
+
| `POST` | `/step` | Submit action. Requires `session_id` and `action` body. |
|
| 383 |
+
| `GET` | `/state` | Current episode state. |
|
| 384 |
+
| `GET` | `/tasks` | Lists all tasks with objectives. |
|
| 385 |
+
| `GET` | `/schema` | JSON schema for action/observation/state. |
|
| 386 |
+
| `GET` | `/health` | Returns `{"status": "healthy", "active_sessions": N}`. |
|
| 387 |
+
|
| 388 |
+
### Example Episode
|
| 389 |
+
|
| 390 |
+
```python
|
| 391 |
+
import requests
|
| 392 |
+
|
| 393 |
+
BASE = "https://aniketasla-debatefloor.hf.space"
|
| 394 |
+
|
| 395 |
+
r = requests.post(f"{BASE}/reset", json={"task_id": "contradictory_claim", "seed": 42})
|
| 396 |
+
session_id = r.json()["session_id"]
|
| 397 |
+
|
| 398 |
+
def step(action):
|
| 399 |
+
return requests.post(f"{BASE}/step", json={"action": action, "session_id": session_id}).json()
|
| 400 |
+
|
| 401 |
+
step({"action_type": "validate_document", "parameters": {"doc_id": "DOC-001"}, "reasoning": "check bill"})
|
| 402 |
+
step({"action_type": "flag_fraud_signal", "parameters": {"flag_id": "procedure_mismatch",
|
| 403 |
+
"evidence": "discharge says appendectomy, bill says cardiac bypass"}, "reasoning": "billing fraud"})
|
| 404 |
+
|
| 405 |
+
resp = step({"action_type": "deny_claim", "confidence": "MED", "reason": "procedure mismatch confirmed"})
|
| 406 |
+
print(f"Reward: {resp['reward']}")
|
| 407 |
+
print(f"Calibration: {resp['observation']['reward_breakdown']['calibration_score']}")
|
| 408 |
+
```
|
| 409 |
+
|
| 410 |
+
---
|
| 411 |
+
|
| 412 |
+
## OpenEnv Spec Compliance
|
| 413 |
+
|
| 414 |
+
| Requirement | Status |
|
| 415 |
+
|-------------|--------|
|
| 416 |
+
| `spec_version: 1` | ✅ |
|
| 417 |
+
| OpenEnv `Environment` base class | ✅ |
|
| 418 |
+
| `/reset`, `/step`, `/state`, `/tasks`, `/health`, `/schema` | ✅ |
|
| 419 |
+
| `supports_concurrent_sessions: true` | ✅ |
|
| 420 |
+
| `max_concurrent_envs: 64` | ✅ |
|
| 421 |
+
| `confidence_required: true` | ✅ |
|
| 422 |
+
| `procedural_generation: true` | ✅ |
|
| 423 |
+
| `episode_pool_size: 500` | ✅ |
|
| 424 |
+
| Reward in `[0.0, 1.0]` | ✅ |
|
| 425 |
+
| Docker deployment | ✅ |
|
| 426 |
+
|
| 427 |
+
---
|
| 428 |
+
|
| 429 |
+
## Team
|
| 430 |
+
|
| 431 |
+
- **Aniket Aslaliya** — Environment Core, Claim Generator, Calibration Grader, UI
|
| 432 |
+
- **Mitali Mehta** — Domain Knowledge (Fraud types, IRDAI regulations), Grader Design
|
| 433 |
+
- **Aditya Sharma** — Training Pipeline, GRPO Notebook, WandB Integration
|
| 434 |
+
|
| 435 |
+
---
|
| 436 |
+
|
| 437 |
+
## Citation
|
| 438 |
+
|
| 439 |
+
```bibtex
|
| 440 |
+
@article{coca2025,
|
| 441 |
+
title={Co-optimizing Confidence and Accuracy via Segment-Specific GRPO Rewards},
|
| 442 |
+
author={...},
|
| 443 |
+
journal={arXiv:2603.05881},
|
| 444 |
+
year={2025}
|
| 445 |
+
}
|
| 446 |
+
```
|
| 447 |
+
|
| 448 |
+
**Related:**
|
| 449 |
+
- CAPO paper (April 2026) — GRPO induces overconfidence; ClaimCourt is the fix
|
| 450 |
+
- OpenEnv: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
|
| 451 |
+
- TRL GRPOTrainer: [huggingface.co/docs/trl/grpo_trainer](https://huggingface.co/docs/trl/grpo_trainer)
|