File size: 13,745 Bytes
26845c3
 
 
 
 
 
 
 
 
 
 
 
a03a1ea
26845c3
 
a03a1ea
26845c3
a03a1ea
26845c3
 
a03a1ea
 
26845c3
eac477f
 
 
 
26845c3
 
a03a1ea
7b42b28
a03a1ea
7b42b28
a03a1ea
26845c3
 
 
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
 
 
26845c3
 
 
4770478
 
 
 
 
 
 
 
 
a03a1ea
26845c3
a03a1ea
eac477f
a03a1ea
26845c3
a03a1ea
26845c3
 
 
a03a1ea
26845c3
a03a1ea
 
26845c3
a03a1ea
 
26845c3
a03a1ea
 
 
 
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
 
a03a1ea
 
 
 
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
 
 
 
 
a03a1ea
 
 
 
 
 
 
26845c3
a03a1ea
26845c3
 
 
 
a03a1ea
26845c3
a03a1ea
 
26845c3
 
 
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
 
 
26845c3
a03a1ea
26845c3
a03a1ea
 
 
 
 
 
 
26845c3
a03a1ea
26845c3
a03a1ea
 
26845c3
a03a1ea
 
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
a03a1ea
26845c3
 
 
a03a1ea
 
 
 
 
 
 
26845c3
a03a1ea
26845c3
 
 
a03a1ea
 
26845c3
 
 
a03a1ea
26845c3
 
 
a03a1ea
26845c3
 
 
a03a1ea
 
 
 
 
 
 
 
26845c3
 
a03a1ea
26845c3
 
 
 
 
 
a03a1ea
26845c3
 
 
 
a03a1ea
26845c3
a03a1ea
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---

title: ClaimCourt  Insurance Calibration RL Environment
emoji: ⚖️
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: true
---


# ClaimCourt — Insurance Calibration RL Environment

> *Repository codename `debatefloor` — all GitHub, Hugging Face Space, and model-repo URLs use the original codename so existing links resolve. The product is **ClaimCourt** everywhere it faces a human reader.*

[![Live Demo](https://img.shields.io/badge/Live%20Demo-Hugging%20Face-orange)](https://huggingface.co/spaces/AniketAsla/debatefloor)
[![Demo Video](https://img.shields.io/badge/Demo%20Video-YouTube-red)](https://www.youtube.com/watch?v=Uk8sSLywEpE)
[![Based on CAPO](https://img.shields.io/badge/Based%20on-CAPO%20arXiv%3A2604.12632-red)](https://arxiv.org/abs/2604.12632)
[![WandB](https://img.shields.io/badge/WandB-Run%20workspace-yellow)](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)

> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
> Built for the **Meta PyTorch × Scaler OpenEnv Hackathon Grand Finale, April 25–26 2026**.

> ### 🎯 Headline result — Calibration score 0.000 → 1.000 on held-out claims
>

> Across a 5,000-episode GRPO run on Qwen2.5-0.5B-Instruct, the trained agent's confidence now matches its correctness on *every* held-out terminal action — directly attacking the GRPO overconfidence pathology documented in [CAPO (arXiv:2604.12632)](https://arxiv.org/abs/2604.12632). Decision accuracy moved 0.000 → 1.000 on the same eval. Both numbers read straight from [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — no hand-edits.

---

## 📺 Watch the 2-minute demo first

▶️ **[YouTube — ClaimCourt demo](https://www.youtube.com/watch?v=Uk8sSLywEpE)** — see the full Court Panel sequence and the calibration matrix lighting up in real time.

## All artifacts in one table

| Artifact | Link |
|---|---|
| **Demo Video (≤2 min)** | https://www.youtube.com/watch?v=Uk8sSLywEpE |
| **Live Environment (HF Space)** | https://huggingface.co/spaces/AniketAsla/debatefloor |
| **Mini-Blog (full writeup)** | [BLOG.md](BLOG.md) |
| **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
| **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](train/train_debatefloor.ipynb) |
| **WandB workspace** (training curves) | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
| **GitHub repo** | https://github.com/AniketAslaliya/debateFloor |

---

## Hackathon theme alignment

| Theme | Fit | Why |
|---|---|---|
| **#3.1 World Modeling — Professional Tasks** *(primary)* | ✅✅✅ | Insurance claim adjudication is a textbook enterprise workflow: 11 investigative tool-calls (`validate_document`, `query_historical_data`, `verify_identity`, …), partially observable state where fraud signals are hidden until queried, multi-step orchestration (10–28 steps per episode), and an explicit **anti-gaming detector** that prevents shortcuts. The agent must maintain consistent internal state, update beliefs as evidence arrives, and orchestrate the workflow toward a calibrated terminal decision. |
| **#1 Multi-Agent Interactions** *(secondary)* | ✅✅ | The **Court Panel** (`convene_debate_panel`) spins up an adversarial Prosecutor + Defender pair from the existing evidence base. The Judge must model both adversaries' incentives and weigh their competing arguments before declaring HIGH/MED/LOW confidence — Fleet AI Scalable Oversight applied to claims work. |

---

## 1. The Problem

LLMs in high-stakes domains suffer from a documented failure mode: **overconfidence**. A model that approves or denies an insurance claim with 100 % certainty — but is wrong — causes real harm. The [CAPO paper (arXiv:2604.12632, 2026)](https://arxiv.org/abs/2604.12632) measures up to a 15 % AUC drop in standard GRPO training. [DCPO (arXiv:2603.09117, 2026)](https://arxiv.org/abs/2603.09117) shows a 71 % Expected-Calibration-Error reduction is achievable when calibration is treated as a first-class objective.

**Why this matters now.** Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore every year** ([BCG × Medi Assist, Nov 2025](https://www.business-standard.com/industry/news/insurance-fwa-drains-rs10000cr-each-year-bcg-mediassist-report-125112101199_1.html)) — about 8 % of all claim payouts. From April 2026, the [IRDAI Insurance Fraud Monitoring Framework Guidelines, 2025](https://irdai.gov.in/) make every insurer legally responsible for catching it. The obvious tool is AI — but standard RL training pushes models in exactly the wrong direction.

**ClaimCourt is the direct fix.** A reward surface that penalises overconfident wrong answers more severely than uncertain ones, teaching models *when* to be confident, not just what to say.

---

## 2. The Environment — what the agent sees, does, and gets rewarded for

### What the agent sees
A claim object: claimant identity, incident details, policy history, attached documents, and the list of available actions. After every action: updated documents, discovered fraud signals, action history, and a partial reward breakdown.

### What the agent does
The agent investigates step-by-step before committing.

| Action class | Examples | Confidence required? |
|---|---|---|
| **Investigative** | `validate_document`, `flag_fraud_signal`, `query_historical_data`, `query_linked_claim`, `verify_identity`, `convene_debate_panel` | No |
| **Terminal** | `approve_claim`, `deny_claim`, `escalate_to_human`, `request_investigation` | **Yes — HIGH / MED / LOW** |

### What the agent gets rewarded for — the 3×2 Calibration Matrix (the core innovation)

Before every terminal action, the agent must declare a confidence level. The reward is determined by this matrix:

| Confidence | Correct Decision | Wrong Decision |
|---|---|---|
| **HIGH** | **+1.0** | **−0.8** ← worst outcome |
| **MED**  | +0.6     | −0.2 |
| **LOW**  | +0.1     | 0.0 ← safe |

An agent that always says HIGH to maximise reward is catastrophically punished when wrong. An agent that always says LOW is caught by the **anti-gaming detector** (LOW-rate > 70 % across 10+ episodes triggers a progressive penalty). **The only winning strategy is accurate calibration** — based on the [CoCA framework (arXiv:2603.05881)](https://arxiv.org/abs/2603.05881).

### The Court Panel — the demo centrepiece

> **No other environment in the OpenEnv hub has this mechanic.** Run `contradictory_claim` in the live Space to see it.



When evidence is mixed, the agent calls `convene_debate_panel`. Two adversarial sub-agents spin up from the existing evidence base:



```

INVESTIGATOR

├── validate_document      → discovers fraud signals
├── query_historical_data  → reveals cross-claim patterns
└── convene_debate_panel

┌──────────────────┐    ┌──────────────────┐

│  PROSECUTOR      │    │  DEFENDER        │

│  Strength: STRONG│    │  Strength: WEAK  │

└──────────────────┘    └──────────────────┘


    PANEL VERDICT → recommendation


    JUDGE: approve / deny / escalate

    + confidence: HIGH / MED / LOW

```


The Court Panel forces the agent to expose its reasoning to a programmatic adversary before declaring confidence — Fleet AI Scalable Oversight, applied to claims work.

### Procedural generation
Episodes are generated procedurally — **5 fraud types × 4 coverage types × 3 jurisdictions × seed variation = 500+ unique episodes**. Same seed → same episode, so reviewers can reproduce exactly what the model saw.

---

## 3. Results

### 5,000-episode GRPO run — Qwen2.5-0.5B-Instruct, L4 GPU, 3 h 3 min

All numbers below are read directly from committed JSON artifacts ([`reports/training_summary.json`](reports/training_summary.json), [`reports/component_shift_summary.json`](reports/component_shift_summary.json)) — no hand-edits.

#### Three headline numbers

- **Training reward: 0.130 → 0.469 (3.6× improvement)** across 2,500 GRPO steps
- **Held-out decision accuracy: 0.000 → 1.000** — the trained model gets every held-out claim right
- **Held-out calibration score: 0.000 → 1.000** — confidence now matches correctness on every terminal action

#### Held-out evaluation (n = 6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)

| Component | Before (untrained) | After (GRPO) | Change |
|---|---:|---:|---|
| **Decision accuracy** | 0.000 | **1.000** | **+1.000** |
| **Calibration**       | 0.000 | **1.000** | **+1.000** |
| **Fraud detection**   | 0.000 | **0.333** | +0.333 |
| Evidence quality      | 0.333 | 0.333     | unchanged |
| Reasoning quality     | 0.833 | 0.792     | −0.042 (within noise) |

#### Training plots

![Reward Curve](docs/reward_curve.png)
*X: training step. Y-left: training loss. Y-right: mean reward (training scalar, unbounded). Mean training reward climbs from 0.130 → 0.469 across 2,500 GRPO steps (5,000 episodes, 1 epoch).*

![Component Shift](docs/component_shift.png)
*Same axes, before vs after on held-out eval (n=6 episodes). Decision accuracy 0 → 1.0, Calibration 0 → 1.0, Fraud detection 0 → 0.33. The lift on the trained metrics is unmissable.*

The trained model learned to **make correct decisions with calibrated confidence** — exactly the skill this environment is designed to teach. The small dip in reasoning quality (−4 pts) is the only trade-off: the model traded a sliver of fluency for sharper decision-making.

---

## 4. Why It Matters

Calibration failure is universal. Every high-stakes domain where an AI must know the limits of its own knowledge has it: medical diagnosis, legal analysis, financial advice, autonomous systems. ClaimCourt is a **blueprint for training epistemic humility into LLMs at the reward level, not the prompt level.**

Insurance is just the first domain. The 3×2 matrix transfers anywhere a model must commit a binary decision and own the confidence behind it.

---

## Quick Start for Reviewers (3 minutes)

1. **Watch the demo** — [▶️ YouTube (≤2 min)](https://www.youtube.com/watch?v=Uk8sSLywEpE)
2. **Open the live UI** — https://huggingface.co/spaces/AniketAsla/debatefloor
3. **Select `contradictory_claim`** → click **Run Episode**.

4. Watch the agent: validate documents → flag fraud signals → **convene the Court Panel** → declare MED confidence → deny claim.

5. The highlighted cell in the 3×2 matrix shows exactly why it scored what it scored.



### Reproduce the training



```bash

git clone https://github.com/AniketAslaliya/debateFloor.git && cd debateFloor

pip install -r requirements.txt          # env server deps

pip install -r train/requirements.txt    # training deps (trl, unsloth, peft, wandb)

PYTHONPATH=. python train/train_minimal.py

```



Or open the [Colab notebook](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb).



---



## Repo layout



```

debateFloor/

├── openenv.yaml              ← OpenEnv spec manifest (5 tasks, 3×2 matrix)

├── Dockerfile                ← HF Space deployment

├── BLOG.md                   ← Mini-blog (full writeup)

├── app/                      ← FastAPI server: /reset /step /state /tasks /health /schema

├── server/                   ← ClaimCourt core: calibration_grader.py, claim_generator.py

├── train/                    ← train_minimal.py + Colab notebook

├── docs/                     ← reward_curve.png, component_shift.png

└── reports/                  ← training_summary.json, component_shift_summary.json

```



OpenEnv compliance: `spec_version: 1`, OpenEnv `Environment` base class, `/reset` `/step` `/state` `/tasks` `/health` `/schema`, `supports_concurrent_sessions: true`, `max_concurrent_envs: 64`, reward in `[0.0, 1.0]`, Docker deployment — full manifest in [`openenv.yaml`](openenv.yaml).



---



## Team



- **Aniket Aslaliya** — Environment Core, Claim Generator, Calibration Grader, UI

- **Mitali Mehta** — Domain Knowledge (fraud types, IRDAI regulations), Grader Design

- **Aditya Sharma** — Training Pipeline, GRPO Notebook, WandB Integration



---



## References



- **CAPO** — Confidence-Aware Policy Optimization ([arXiv:2604.12632](https://arxiv.org/abs/2604.12632), 2026) — documents the GRPO overconfidence pathology ClaimCourt fixes

- **DCPO** — Distribution-Calibrated Policy Optimization ([arXiv:2603.09117](https://arxiv.org/abs/2603.09117), 2026) — proves calibration improvement is achievable as a training objective

- **CoCA** — Co-optimising Confidence and Accuracy via segment-specific GRPO rewards ([arXiv:2603.05881](https://arxiv.org/abs/2603.05881))

- **OpenEnv** — [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)

- **TRL `GRPOTrainer`** — [huggingface.co/docs/trl/grpo_trainer](https://huggingface.co/docs/trl/grpo_trainer)