AniketAsla commited on
Commit
26845c3
·
verified ·
1 Parent(s): 52e5bb1

deploy: update README.md

Browse files
Files changed (1) hide show
  1. README.md +451 -451
README.md CHANGED
@@ -1,451 +1,451 @@
1
- ---
2
- title: ClaimCourt — Insurance Calibration RL Environment
3
- emoji: ⚖️
4
- colorFrom: indigo
5
- colorTo: purple
6
- sdk: docker
7
- app_port: 7860
8
- pinned: true
9
- ---
10
-
11
- # ClaimCourt — Insurance Calibration RL Environment
12
-
13
- > *Codename in the repo & URLs: `debatefloor` — all GitHub, Hugging Face Space, and model-repo slugs use the original codename so existing links continue to resolve. The product is **ClaimCourt** everywhere it faces a human reader.*
14
-
15
- [![Tests](https://img.shields.io/badge/Tests-Passing-brightgreen)](https://github.com/AniketAslaliya/debateFloor)
16
- [![Live Demo](https://img.shields.io/badge/Live%20Demo-Hugging%20Face-orange)](https://huggingface.co/spaces/AniketAsla/debatefloor)
17
- [![Based on CAPO](https://img.shields.io/badge/Based%20on-CAPO%20arXiv%3A2604.12632-red)](https://arxiv.org/abs/2604.12632)
18
- [![WandB Run](https://img.shields.io/badge/WandB-Training%20Run-yellow)](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl/runs/vloynjdu)
19
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
20
-
21
- > An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
22
- > Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
23
-
24
- ---
25
-
26
- ## Problem Statement
27
-
28
- LLMs deployed in high-stakes domains suffer from a well-documented failure mode: **overconfidence**. A model that approves or denies an insurance claim with 100 % certainty — but is wrong — causes real harm. The [CAPO paper (arXiv:2604.12632, 2026)](https://arxiv.org/abs/2604.12632) measures up to a 15 % AUC drop in standard GRPO training, and [DCPO (arXiv:2603.09117, 2026)](https://arxiv.org/abs/2603.09117) shows a 71 % Expected-Calibration-Error reduction is achievable when calibration is treated as a first-class objective.
29
-
30
- **ClaimCourt is the direct fix.** It trains LLMs to declare *calibrated* confidence before every decision, using a reward surface that penalises overconfident wrong answers more severely than uncertain ones. This teaches models **when** to be confident, not just what to say.
31
-
32
- Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore every year** ([BCG × Medi Assist, Nov 2025](https://www.business-standard.com/industry/news/insurance-fwa-drains-rs10000cr-each-year-bcg-mediassist-report-125112101199_1.html)) — about 8 % of all claim payouts. From April 2026, the [IRDAI Insurance Fraud Monitoring Framework Guidelines, 2025](https://irdai.gov.in/) make every insurer legally responsible for detecting it. AI is the obvious tool, but recent research ([CAPO, arXiv:2604.12632](https://arxiv.org/abs/2604.12632); [DCPO, arXiv:2603.09117](https://arxiv.org/abs/2603.09117)) proves standard GRPO training makes models *more* overconfident as they get more accurate — exactly the wrong direction for high-stakes claims work.
33
-
34
- ---
35
-
36
- ## Submission Artifacts
37
-
38
- | Artifact | Link |
39
- |---|---|
40
- | **Live Environment (HF Space)** | https://huggingface.co/spaces/AniketAsla/debatefloor |
41
- | **WandB Training Run** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl/runs/vloynjdu |
42
- | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
43
- | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
44
- | **Mini-Blog** | [docs/HFBlogPost.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/HFBlogPost.md) |
45
-
46
- ---
47
-
48
- ## How This Submission Maps to the Judging Rubric
49
-
50
- | Criterion | Weight | Where to find the evidence |
51
- |---|---|---|
52
- | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
53
- | **Storytelling & Presentation** | 30% | [`docs/HFBlogPost.md`](docs/HFBlogPost.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
54
- | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
55
- | **Reward & Training Pipeline** | 10% | [`app/services/reward.py`](app/services/reward.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
56
-
57
- ### Minimum-requirement checklist (for judges)
58
-
59
- - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
60
- - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
61
- - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
62
- - [x] **Mini-blog** at [`docs/HFBlogPost.md`](docs/HFBlogPost.md)
63
- - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
64
- - [x] **README** motivates the problem, explains the env, and shows results (this file)
65
- - [x] **`openenv.yaml`** manifest valid — see repo root
66
- - [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `clients/`
67
-
68
- ---
69
-
70
- ## Results
71
-
72
- All numbers below are **read directly from committed JSON artifacts** — no
73
- hand-edits, no rounded-up estimates. Source:
74
- [`reports/training_summary.json`](reports/training_summary.json),
75
- [`reports/component_shift_summary.json`](reports/component_shift_summary.json).
76
-
77
- ### GRPO training — 5,000 episodes, Qwen2.5-0.5B-Instruct
78
-
79
- | Config | Value |
80
- |---|---|
81
- | Episodes | 5,000 |
82
- | Epochs | 1 |
83
- | GRPO steps | 2,500 |
84
- | Batch / Generations | 8 / 8 |
85
- | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
86
- | WandB | [Run link](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl/runs/vloynjdu) |
87
-
88
- ### Headline result: training reward 0.130 → 0.469 (3.6× improvement)
89
-
90
- ### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
91
-
92
- | Component | Before (untrained) | After (GRPO) | Change |
93
- |---|---:|---:|---|
94
- | **Decision accuracy** | 0.000 | **1.000** | **+1.000** |
95
- | **Calibration** | 0.000 | **1.000** | **+1.000** |
96
- | **Fraud detection** | 0.000 | **0.333** | +0.333 |
97
- | Evidence quality | 0.333 | 0.333 | unchanged |
98
- | Reasoning quality | 0.833 | 0.792 | −0.042 (within noise) |
99
-
100
- The trained model learned to **make correct decisions with calibrated
101
- confidence** — exactly the skill this environment is designed to teach.
102
- Decision accuracy and calibration both went from zero to perfect on the
103
- held-out eval set. The small dip in reasoning quality (−4 pts) is a
104
- known trade-off: the model traded a sliver of fluency for sharper
105
- decision-making.
106
-
107
- ### Training Plots
108
-
109
- ![Reward Curve](docs/reward_curve.svg)
110
- *Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
111
- Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
112
- [`reports/training_summary.json`](reports/training_summary.json).*
113
-
114
- ![Component Shift](docs/component_shift.svg)
115
- *Before vs after on held-out eval: Decision accuracy 0 → 1.0,
116
- Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
117
- [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
118
-
119
- ---
120
-
121
- ## Quick Start for Reviewers (3 minutes)
122
-
123
- 1. **Open the live UI:** https://huggingface.co/spaces/AniketAsla/debatefloor
124
- 2. **Select `contradictory_claim`** and click **Run Episode**.
125
- 3. Watch the agent: validate documents → flag fraud signals → **convene the Court Panel (Prosecutor vs Defender)** → declare MED confidence → deny claim.
126
- 4. The highlighted cell in the 3×2 matrix shows exactly why it scored what it scored.
127
-
128
- ---
129
-
130
- ## What Makes This Novel
131
-
132
- - **Training environment, not a benchmark.** Episodes are procedurally generated from seeds — the agent cannot memorise answers.
133
- - **Teaches calibration, not just accuracy.** Overconfident wrong answers are penalised harder than uncertain ones. No other OpenEnv environment has this.
134
- - **Multi-agent by design.** The final decision is informed by the adversarial **Court Panel** (Prosecutor vs Defender) before the Judge commits. This is Fleet AI Scalable Oversight.
135
- - **Anti-gaming system.** An agent cannot win by always saying LOW confidence or always saying HIGH. It must learn genuine calibration.
136
-
137
- ---
138
-
139
- ## Theme Coverage
140
-
141
- | Theme | Bonus Prize | What We Built |
142
- |-------|-------------|---------------|
143
- | **Theme 3.1** — World Modeling (Professional) | Scaler AI Labs: Multi-App RL for Enterprise Workflows | 5 fraud types, multi-doc investigation, IRDAI registry, policy history |
144
- | **Theme 1** — Multi-Agent Interactions | Fleet AI: Scalable Oversight | 3-agent Court Panel: Prosecutor + Defender + Judge |
145
- | **Theme 4** — Self-Improvement | Curriculum / difficulty escalation | easy→medium→hard + anti-gaming detector |
146
-
147
- ---
148
-
149
- ## The Core Innovation: 3×2 Calibration Matrix
150
-
151
- Before every terminal action, the agent must declare a confidence level: **HIGH**, **MED**, or **LOW**. The reward is determined by this matrix:
152
-
153
- | Confidence | Correct Decision | Wrong Decision |
154
- |------------|-----------------|----------------|
155
- | **HIGH** | +1.0 | **−0.8** ← worst outcome |
156
- | **MED** | +0.6 | −0.2 |
157
- | **LOW** | +0.1 | 0.0 ← safe |
158
-
159
- An agent that always says HIGH to maximise reward is catastrophically punished when wrong. An agent that always says LOW is caught by the anti-gaming system. **The only winning strategy is accurate calibration.**
160
-
161
- Based on the [CoCA framework (arXiv:2603.05881)](https://arxiv.org/abs/2603.05881) — co-optimising confidence and accuracy via GRPO.
162
-
163
- ---
164
-
165
- ## The Court Panel — The Demo Centrepiece
166
-
167
- > **No other environment in the OpenEnv hub has this mechanic.** Run `contradictory_claim` in the live UI to see it unfold.
168
-
169
- **The 90-second sequence that wins the storytelling criterion:**
170
-
171
- 1. Agent validates 3 documents, discovers `date_mismatch` + `cost_inflation` fraud signals.
172
- 2. Agent calls `convene_debate_panel` — two sub-agents spin up from the evidence base.
173
- 3. **Prosecutor [STRONG]:** *"2 fraud signals, billing 2.4× standard rate — deny."*
174
- 4. **Defender [WEAK]:** *"Documents internally consistent, burden of proof requires more."*
175
- 5. Panel verdict: **Prosecution substantially outweighs defense.**
176
- 6. Agent reads transcript → declares **MED confidence** → `deny_claim` → scores **+0.6**.
177
- 7. The calibration matrix highlights `MED × correct`. The reviewer sees exactly why.
178
-
179
- ```
180
- INVESTIGATOR
181
- ├── validate_document → discovers fraud signals
182
- ├── flag_fraud_signal → formally raises grounded signal
183
- ├── query_historical_data → reveals cross-claim patterns
184
- └── Builds evidence base over N steps
185
-
186
- convene_debate_panel
187
-
188
- ┌───────────────────┐ ┌────────────────────┐
189
- │ PROSECUTOR │ │ DEFENDER │
190
- │ • fraud signals │ │ • doc consistency │
191
- │ • Strength: STRONG│ │ • Strength: WEAK │
192
- └───────────────────┘ └────────────────────┘
193
-
194
- PANEL VERDICT → recommendation
195
-
196
- JUDGE: approve / deny / escalate
197
- + confidence: HIGH / MED / LOW
198
- → calibration_score via 3×2 matrix
199
- ```
200
-
201
- ---
202
-
203
- ## Why This Is the Right RL Task
204
-
205
- ClaimCourt satisfies all three properties of a well-designed RL task:
206
-
207
- - **Step-by-step:** The agent validates documents, queries history, flags signals, and uses the Court Panel before committing. Each step changes the information state.
208
- - **Programmatically verifiable:** Ground truth is embedded in every generated episode (`staged_accident → deny_claim`). No human labeller needed.
209
- - **Hard enough to matter:** Easy claims are solvable with 2 steps. Hard claims require discovering cross-claim fraud rings across linked sessions. The model must earn its confidence.
210
-
211
- ---
212
-
213
- ## The 3 Tasks
214
-
215
- | Task | Difficulty | Max Steps | Correct Decision | Expected Confidence |
216
- |------|-----------|-----------|-----------------|---------------------|
217
- | `clean_claim` | Easy | 10 | `approve_claim` | HIGH |
218
- | `contradictory_claim` | Medium | 18 | `deny_claim` | MED |
219
- | `distribution_shift_claim` | Hard | 28 | `escalate_to_human` | LOW |
220
-
221
- `distribution_shift_claim` looks clean on the surface. The agent must call `query_linked_claim` or `query_historical_data` to discover cross-claim fraud signals. If the agent declares HIGH confidence, it is **always penalised regardless of decision** — this task is designed to require epistemic humility.
222
-
223
- ---
224
-
225
- ## Procedural Generation
226
-
227
- A benchmark has fixed episodes. ClaimCourt generates them procedurally:
228
-
229
- ```python
230
- from server.claim_generator import generate_claim
231
-
232
- # Same inputs → same episode (deterministic, reproducible)
233
- episode = generate_claim(seed=42, fraud_type="medical_inflation",
234
- coverage_type="health", difficulty="medium")
235
- ```
236
-
237
- **5 fraud types × 4 coverage types × 3 jurisdictions × seed variation = 500+ unique training episodes**
238
-
239
- | Fraud Type | Ground Truth | Key Signal |
240
- |-----------|-------------|------------|
241
- | `staged_accident` | `deny_claim` | Cost mismatch between damage and repair estimate |
242
- | `medical_inflation` | `deny_claim` | Procedure in bill ≠ procedure in discharge summary |
243
- | `identity_fraud` | `deny_claim` | Ghost claimant, policy opened 5 days before incident |
244
- | `coordinated_ring` | `escalate_to_human` | Shared broker across 3–5 simultaneous claims |
245
- | `phantom_provider` | `deny_claim` | Hospital not in IRDAI registry, invalid GST |
246
-
247
- ---
248
-
249
- ## Reward Design
250
-
251
- ### Training Reward (use for GRPO — simple scalar for stable gradients)
252
-
253
- ```python
254
- def training_reward(decision, confidence, ground_truth, legitimate_flags, step_num, done):
255
- r = -0.05 # step penalty (efficiency)
256
- if done:
257
- r += 1.0 if correct else -0.5 # decision accuracy
258
- r += 0.3 * min(legitimate_flags, 3) # fraud signal detection
259
- r += 0.5 * calibration_matrix[(confidence, correct)] # calibration bonus
260
- return r
261
- ```
262
-
263
- ### Evaluation Reward (for demo and reporting only — do not use for GRPO)
264
-
265
- ```python
266
- def eval_reward(episode):
267
- return (0.35 * calibration_reward # confidence accuracy
268
- + 0.25 * escalation_reward # appropriate uncertainty escalation
269
- + 0.20 * evidence_quality # grounded signal citations
270
- + 0.10 * efficiency_score # step efficiency
271
- - 0.10 * gaming_penalty) # anti-gaming deduction
272
- ```
273
-
274
- ### Anti-Gaming System
275
-
276
- ```
277
- if LOW_rate > 70% across 10+ episodes: penalty = (rate − 0.70) × 2.0
278
- if HIGH_rate > 80% across 10+ episodes: penalty = (rate − 0.80) × 1.5
279
- ```
280
-
281
- ---
282
-
283
- ## Training Pipeline
284
-
285
- **Model:** `Qwen/Qwen2.5-0.5B-Instruct` — open-source, no OpenAI API
286
- **Algorithm:** HF TRL `GRPOTrainer` + Unsloth 4-bit QLoRA (Group Relative Policy Optimization — same as DeepSeek-R1)
287
- **Full run:** L4 GPU on HF Jobs — 5,000 episodes, 2,500 steps, 3 h 3 min
288
- **Quick run:** Free Colab T4 GPU — 100 episodes, ~15 min (see notebook)
289
- **WandB Run:** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl/runs/vloynjdu
290
-
291
- ```bash
292
- # Reproduce the training run
293
- git clone https://github.com/AniketAslaliya/debateFloor.git && cd debateFloor
294
-
295
- # Use the canonical pinned requirements files (every dep verified to
296
- # import inside train_minimal.py and the env server).
297
- pip install -r requirements.txt # env server deps (FastAPI, openenv-core, ...)
298
- pip install -r train/requirements.txt # training deps (trl, unsloth, peft, wandb, ...)
299
-
300
- # Optional (Colab T4): swap the pinned unsloth for the colab-new wheel
301
- # pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
302
- #
303
- # If you see: ModuleNotFoundError: No module named 'mergekit' when importing
304
- # GRPOTrainer — you skipped train/requirements.txt. Re-run: pip install -r train/requirements.txt
305
- # (mergekit is required by recent TRL for the GRPO import path.)
306
-
307
- PYTHONPATH=. python train/train_minimal.py
308
- ```
309
-
310
- Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
311
-
312
- Artifacts generated after training:
313
- - `docs/reward_curve.svg`
314
- - `docs/component_shift.svg`
315
- - `reports/training_summary.json`
316
-
317
- ---
318
-
319
- ## Architecture & Code Map
320
-
321
- **ClaimCourt** — after `git clone`, your working directory is `debateFloor/` (GitHub repo name; codename `debatefloor` in HF/WandB URLs).
322
-
323
- ```
324
- debateFloor/
325
- ├── openenv.yaml ← OpenEnv spec manifest
326
- ├── Dockerfile ← HF Space deployment
327
- ├── requirements.txt
328
-
329
- ├── app/ ← FastAPI server (OpenEnv contract)
330
- │ ├── main.py ← /reset /step /state /tasks /health /schema
331
- │ ├── environment.py ← InsuranceClaimEnvironment + Court Panel
332
- │ ├── models.py ← Pydantic action/observation models
333
- │ └── tasks.py ← task definitions
334
-
335
- ├── server/ ← ClaimCourt core
336
- │ ├── calibration_grader.py ← 3×2 matrix + anti-gaming + training/eval reward
337
- │ └── claim_generator.py ← procedural episode generator (500+ episodes)
338
-
339
- ├── train/
340
- │ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
341
- │ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
342
-
343
- ├── docs/
344
- │ ├── reward_curve.svg ← training reward curve (embedded above)
345
- │ ├── component_shift.svg ← before/after component scores (embedded above)
346
- │ └── HFBlogPost.md ← writeup
347
-
348
- └── reports/
349
- ├── training_summary.json
350
- └── component_shift_summary.json
351
- ```
352
-
353
- ---
354
-
355
- ## Quickstart
356
-
357
- ### Run locally
358
-
359
- ```bash
360
- git clone https://github.com/AniketAslaliya/debateFloor.git
361
- cd debateFloor
362
- pip install -r requirements.txt
363
- PYTHONPATH=. uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload
364
- ```
365
-
366
- ### Run with Docker
367
-
368
- ```bash
369
- docker build -t claimcourt .
370
- docker run -p 7860:7860 claimcourt
371
- ```
372
-
373
- ---
374
-
375
- ## API Reference
376
-
377
- All endpoints follow the OpenEnv REST contract:
378
-
379
- | Method | Endpoint | Description |
380
- |--------|----------|-------------|
381
- | `POST` | `/reset` | Start new episode. Accepts `task_id`, `seed`, `session_id`. |
382
- | `POST` | `/step` | Submit action. Requires `session_id` and `action` body. |
383
- | `GET` | `/state` | Current episode state. |
384
- | `GET` | `/tasks` | Lists all tasks with objectives. |
385
- | `GET` | `/schema` | JSON schema for action/observation/state. |
386
- | `GET` | `/health` | Returns `{"status": "healthy", "active_sessions": N}`. |
387
-
388
- ### Example Episode
389
-
390
- ```python
391
- import requests
392
-
393
- BASE = "https://aniketasla-debatefloor.hf.space"
394
-
395
- r = requests.post(f"{BASE}/reset", json={"task_id": "contradictory_claim", "seed": 42})
396
- session_id = r.json()["session_id"]
397
-
398
- def step(action):
399
- return requests.post(f"{BASE}/step", json={"action": action, "session_id": session_id}).json()
400
-
401
- step({"action_type": "validate_document", "parameters": {"doc_id": "DOC-001"}, "reasoning": "check bill"})
402
- step({"action_type": "flag_fraud_signal", "parameters": {"flag_id": "procedure_mismatch",
403
- "evidence": "discharge says appendectomy, bill says cardiac bypass"}, "reasoning": "billing fraud"})
404
-
405
- resp = step({"action_type": "deny_claim", "confidence": "MED", "reason": "procedure mismatch confirmed"})
406
- print(f"Reward: {resp['reward']}")
407
- print(f"Calibration: {resp['observation']['reward_breakdown']['calibration_score']}")
408
- ```
409
-
410
- ---
411
-
412
- ## OpenEnv Spec Compliance
413
-
414
- | Requirement | Status |
415
- |-------------|--------|
416
- | `spec_version: 1` | ✅ |
417
- | OpenEnv `Environment` base class | ✅ |
418
- | `/reset`, `/step`, `/state`, `/tasks`, `/health`, `/schema` | ✅ |
419
- | `supports_concurrent_sessions: true` | ✅ |
420
- | `max_concurrent_envs: 64` | ✅ |
421
- | `confidence_required: true` | ✅ |
422
- | `procedural_generation: true` | ✅ |
423
- | `episode_pool_size: 500` | ✅ |
424
- | Reward in `[0.0, 1.0]` | ✅ |
425
- | Docker deployment | ✅ |
426
-
427
- ---
428
-
429
- ## Team
430
-
431
- - **Aniket Aslaliya** — Environment Core, Claim Generator, Calibration Grader, UI
432
- - **Mitali Mehta** — Domain Knowledge (Fraud types, IRDAI regulations), Grader Design
433
- - **Aditya Sharma** — Training Pipeline, GRPO Notebook, WandB Integration
434
-
435
- ---
436
-
437
- ## Citation
438
-
439
- ```bibtex
440
- @article{coca2025,
441
- title={Co-optimizing Confidence and Accuracy via Segment-Specific GRPO Rewards},
442
- author={...},
443
- journal={arXiv:2603.05881},
444
- year={2025}
445
- }
446
- ```
447
-
448
- **Related:**
449
- - CAPO paper (April 2026) — GRPO induces overconfidence; ClaimCourt is the fix
450
- - OpenEnv: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
451
- - TRL GRPOTrainer: [huggingface.co/docs/trl/grpo_trainer](https://huggingface.co/docs/trl/grpo_trainer)
 
1
+ ---
2
+ title: ClaimCourt — Insurance Calibration RL Environment
3
+ emoji: ⚖️
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: true
9
+ ---
10
+
11
+ # ClaimCourt — Insurance Calibration RL Environment
12
+
13
+ > *Codename in the repo & URLs: `debatefloor` — all GitHub, Hugging Face Space, and model-repo slugs use the original codename so existing links continue to resolve. The product is **ClaimCourt** everywhere it faces a human reader.*
14
+
15
+ [![Tests](https://img.shields.io/badge/Tests-Passing-brightgreen)](https://github.com/AniketAslaliya/debateFloor)
16
+ [![Live Demo](https://img.shields.io/badge/Live%20Demo-Hugging%20Face-orange)](https://huggingface.co/spaces/AniketAsla/debatefloor)
17
+ [![Based on CAPO](https://img.shields.io/badge/Based%20on-CAPO%20arXiv%3A2604.12632-red)](https://arxiv.org/abs/2604.12632)
18
+ [![WandB](https://img.shields.io/badge/WandB-Project%20workspace-yellow)](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl)
19
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
20
+
21
+ > An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment (**ClaimCourt**) where AI agents investigate insurance claims, argue in an adversarial **Court Panel**, and must declare **calibrated confidence** before every terminal decision.
22
+ > Built for the **Meta PyTorch × Scaler Hackathon Grand Finale, April 25–26 2026**.
23
+
24
+ ---
25
+
26
+ ## Problem Statement
27
+
28
+ LLMs deployed in high-stakes domains suffer from a well-documented failure mode: **overconfidence**. A model that approves or denies an insurance claim with 100 % certainty — but is wrong — causes real harm. The [CAPO paper (arXiv:2604.12632, 2026)](https://arxiv.org/abs/2604.12632) measures up to a 15 % AUC drop in standard GRPO training, and [DCPO (arXiv:2603.09117, 2026)](https://arxiv.org/abs/2603.09117) shows a 71 % Expected-Calibration-Error reduction is achievable when calibration is treated as a first-class objective.
29
+
30
+ **ClaimCourt is the direct fix.** It trains LLMs to declare *calibrated* confidence before every decision, using a reward surface that penalises overconfident wrong answers more severely than uncertain ones. This teaches models **when** to be confident, not just what to say.
31
+
32
+ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore every year** ([BCG × Medi Assist, Nov 2025](https://www.business-standard.com/industry/news/insurance-fwa-drains-rs10000cr-each-year-bcg-mediassist-report-125112101199_1.html)) — about 8 % of all claim payouts. From April 2026, the [IRDAI Insurance Fraud Monitoring Framework Guidelines, 2025](https://irdai.gov.in/) make every insurer legally responsible for detecting it. AI is the obvious tool, but recent research ([CAPO, arXiv:2604.12632](https://arxiv.org/abs/2604.12632); [DCPO, arXiv:2603.09117](https://arxiv.org/abs/2603.09117)) proves standard GRPO training makes models *more* overconfident as they get more accurate — exactly the wrong direction for high-stakes claims work.
33
+
34
+ ---
35
+
36
+ ## Submission Artifacts
37
+
38
+ | Artifact | Link |
39
+ |---|---|
40
+ | **Live Environment (HF Space)** | https://huggingface.co/spaces/AniketAsla/debatefloor |
41
+ | **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
42
+ | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
43
+ | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
44
+ | **Mini-Blog** | [docs/HFBlogPost.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/HFBlogPost.md) |
45
+
46
+ ---
47
+
48
+ ## How This Submission Maps to the Judging Rubric
49
+
50
+ | Criterion | Weight | Where to find the evidence |
51
+ |---|---|---|
52
+ | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
53
+ | **Storytelling & Presentation** | 30% | [`docs/HFBlogPost.md`](docs/HFBlogPost.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
54
+ | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
55
+ | **Reward & Training Pipeline** | 10% | [`app/services/reward.py`](app/services/reward.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
56
+
57
+ ### Minimum-requirement checklist (for judges)
58
+
59
+ - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
60
+ - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
61
+ - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
62
+ - [x] **Mini-blog** at [`docs/HFBlogPost.md`](docs/HFBlogPost.md)
63
+ - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
64
+ - [x] **README** motivates the problem, explains the env, and shows results (this file)
65
+ - [x] **`openenv.yaml`** manifest valid — see repo root
66
+ - [x] **Gym-style API** (`reset` / `step` / `state`) and **client/server separation** — see `app/` and `clients/`
67
+
68
+ ---
69
+
70
+ ## Results
71
+
72
+ All numbers below are **read directly from committed JSON artifacts** — no
73
+ hand-edits, no rounded-up estimates. Source:
74
+ [`reports/training_summary.json`](reports/training_summary.json),
75
+ [`reports/component_shift_summary.json`](reports/component_shift_summary.json).
76
+
77
+ ### GRPO training — 5,000 episodes, Qwen2.5-0.5B-Instruct
78
+
79
+ | Config | Value |
80
+ |---|---|
81
+ | Episodes | 5,000 |
82
+ | Epochs | 1 |
83
+ | GRPO steps | 2,500 |
84
+ | Batch / Generations | 8 / 8 |
85
+ | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
86
+ | WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
87
+
88
+ ### Headline result: training reward 0.130 → 0.469 (3.6× improvement)
89
+
90
+ ### Held-out evaluation (6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)
91
+
92
+ | Component | Before (untrained) | After (GRPO) | Change |
93
+ |---|---:|---:|---|
94
+ | **Decision accuracy** | 0.000 | **1.000** | **+1.000** |
95
+ | **Calibration** | 0.000 | **1.000** | **+1.000** |
96
+ | **Fraud detection** | 0.000 | **0.333** | +0.333 |
97
+ | Evidence quality | 0.333 | 0.333 | unchanged |
98
+ | Reasoning quality | 0.833 | 0.792 | −0.042 (within noise) |
99
+
100
+ The trained model learned to **make correct decisions with calibrated
101
+ confidence** — exactly the skill this environment is designed to teach.
102
+ Decision accuracy and calibration both went from zero to perfect on the
103
+ held-out eval set. The small dip in reasoning quality (−4 pts) is a
104
+ known trade-off: the model traded a sliver of fluency for sharper
105
+ decision-making.
106
+
107
+ ### Training Plots
108
+
109
+ ![Reward Curve](docs/reward_curve.svg)
110
+ *Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
111
+ Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
112
+ [`reports/training_summary.json`](reports/training_summary.json).*
113
+
114
+ ![Component Shift](docs/component_shift.svg)
115
+ *Before vs after on held-out eval: Decision accuracy 0 → 1.0,
116
+ Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
117
+ [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
118
+
119
+ ---
120
+
121
+ ## Quick Start for Reviewers (3 minutes)
122
+
123
+ 1. **Open the live UI:** https://huggingface.co/spaces/AniketAsla/debatefloor
124
+ 2. **Select `contradictory_claim`** and click **Run Episode**.
125
+ 3. Watch the agent: validate documents → flag fraud signals → **convene the Court Panel (Prosecutor vs Defender)** → declare MED confidence → deny claim.
126
+ 4. The highlighted cell in the 3×2 matrix shows exactly why it scored what it scored.
127
+
128
+ ---
129
+
130
+ ## What Makes This Novel
131
+
132
+ - **Training environment, not a benchmark.** Episodes are procedurally generated from seeds — the agent cannot memorise answers.
133
+ - **Teaches calibration, not just accuracy.** Overconfident wrong answers are penalised harder than uncertain ones. No other OpenEnv environment has this.
134
+ - **Multi-agent by design.** The final decision is informed by the adversarial **Court Panel** (Prosecutor vs Defender) before the Judge commits. This is Fleet AI Scalable Oversight.
135
+ - **Anti-gaming system.** An agent cannot win by always saying LOW confidence or always saying HIGH. It must learn genuine calibration.
136
+
137
+ ---
138
+
139
+ ## Theme Coverage
140
+
141
+ | Theme | Bonus Prize | What We Built |
142
+ |-------|-------------|---------------|
143
+ | **Theme 3.1** — World Modeling (Professional) | Scaler AI Labs: Multi-App RL for Enterprise Workflows | 5 fraud types, multi-doc investigation, IRDAI registry, policy history |
144
+ | **Theme 1** — Multi-Agent Interactions | Fleet AI: Scalable Oversight | 3-agent Court Panel: Prosecutor + Defender + Judge |
145
+ | **Theme 4** — Self-Improvement | Curriculum / difficulty escalation | easy→medium→hard + anti-gaming detector |
146
+
147
+ ---
148
+
149
+ ## The Core Innovation: 3×2 Calibration Matrix
150
+
151
+ Before every terminal action, the agent must declare a confidence level: **HIGH**, **MED**, or **LOW**. The reward is determined by this matrix:
152
+
153
+ | Confidence | Correct Decision | Wrong Decision |
154
+ |------------|-----------------|----------------|
155
+ | **HIGH** | +1.0 | **−0.8** ← worst outcome |
156
+ | **MED** | +0.6 | −0.2 |
157
+ | **LOW** | +0.1 | 0.0 ← safe |
158
+
159
+ An agent that always says HIGH to maximise reward is catastrophically punished when wrong. An agent that always says LOW is caught by the anti-gaming system. **The only winning strategy is accurate calibration.**
160
+
161
+ Based on the [CoCA framework (arXiv:2603.05881)](https://arxiv.org/abs/2603.05881) — co-optimising confidence and accuracy via GRPO.
162
+
163
+ ---
164
+
165
+ ## The Court Panel — The Demo Centrepiece
166
+
167
+ > **No other environment in the OpenEnv hub has this mechanic.** Run `contradictory_claim` in the live UI to see it unfold.
168
+
169
+ **The 90-second sequence that wins the storytelling criterion:**
170
+
171
+ 1. Agent validates 3 documents, discovers `date_mismatch` + `cost_inflation` fraud signals.
172
+ 2. Agent calls `convene_debate_panel` — two sub-agents spin up from the evidence base.
173
+ 3. **Prosecutor [STRONG]:** *"2 fraud signals, billing 2.4× standard rate — deny."*
174
+ 4. **Defender [WEAK]:** *"Documents internally consistent, burden of proof requires more."*
175
+ 5. Panel verdict: **Prosecution substantially outweighs defense.**
176
+ 6. Agent reads transcript → declares **MED confidence** → `deny_claim` → scores **+0.6**.
177
+ 7. The calibration matrix highlights `MED × correct`. The reviewer sees exactly why.
178
+
179
+ ```
180
+ INVESTIGATOR
181
+ ├── validate_document → discovers fraud signals
182
+ ├── flag_fraud_signal → formally raises grounded signal
183
+ ├── query_historical_data → reveals cross-claim patterns
184
+ └── Builds evidence base over N steps
185
+
186
+ convene_debate_panel
187
+
188
+ ┌───────────────────┐ ┌────────────────────┐
189
+ │ PROSECUTOR │ │ DEFENDER │
190
+ │ • fraud signals │ │ • doc consistency │
191
+ │ • Strength: STRONG│ │ • Strength: WEAK │
192
+ └───────────────────┘ └────────────────────┘
193
+
194
+ PANEL VERDICT → recommendation
195
+
196
+ JUDGE: approve / deny / escalate
197
+ + confidence: HIGH / MED / LOW
198
+ → calibration_score via 3×2 matrix
199
+ ```
200
+
201
+ ---
202
+
203
+ ## Why This Is the Right RL Task
204
+
205
+ ClaimCourt satisfies all three properties of a well-designed RL task:
206
+
207
+ - **Step-by-step:** The agent validates documents, queries history, flags signals, and uses the Court Panel before committing. Each step changes the information state.
208
+ - **Programmatically verifiable:** Ground truth is embedded in every generated episode (`staged_accident → deny_claim`). No human labeller needed.
209
+ - **Hard enough to matter:** Easy claims are solvable with 2 steps. Hard claims require discovering cross-claim fraud rings across linked sessions. The model must earn its confidence.
210
+
211
+ ---
212
+
213
+ ## The 3 Tasks
214
+
215
+ | Task | Difficulty | Max Steps | Correct Decision | Expected Confidence |
216
+ |------|-----------|-----------|-----------------|---------------------|
217
+ | `clean_claim` | Easy | 10 | `approve_claim` | HIGH |
218
+ | `contradictory_claim` | Medium | 18 | `deny_claim` | MED |
219
+ | `distribution_shift_claim` | Hard | 28 | `escalate_to_human` | LOW |
220
+
221
+ `distribution_shift_claim` looks clean on the surface. The agent must call `query_linked_claim` or `query_historical_data` to discover cross-claim fraud signals. If the agent declares HIGH confidence, it is **always penalised regardless of decision** — this task is designed to require epistemic humility.
222
+
223
+ ---
224
+
225
+ ## Procedural Generation
226
+
227
+ A benchmark has fixed episodes. ClaimCourt generates them procedurally:
228
+
229
+ ```python
230
+ from server.claim_generator import generate_claim
231
+
232
+ # Same inputs → same episode (deterministic, reproducible)
233
+ episode = generate_claim(seed=42, fraud_type="medical_inflation",
234
+ coverage_type="health", difficulty="medium")
235
+ ```
236
+
237
+ **5 fraud types × 4 coverage types × 3 jurisdictions × seed variation = 500+ unique training episodes**
238
+
239
+ | Fraud Type | Ground Truth | Key Signal |
240
+ |-----------|-------------|------------|
241
+ | `staged_accident` | `deny_claim` | Cost mismatch between damage and repair estimate |
242
+ | `medical_inflation` | `deny_claim` | Procedure in bill ≠ procedure in discharge summary |
243
+ | `identity_fraud` | `deny_claim` | Ghost claimant, policy opened 5 days before incident |
244
+ | `coordinated_ring` | `escalate_to_human` | Shared broker across 3–5 simultaneous claims |
245
+ | `phantom_provider` | `deny_claim` | Hospital not in IRDAI registry, invalid GST |
246
+
247
+ ---
248
+
249
+ ## Reward Design
250
+
251
+ ### Training Reward (use for GRPO — simple scalar for stable gradients)
252
+
253
+ ```python
254
+ def training_reward(decision, confidence, ground_truth, legitimate_flags, step_num, done):
255
+ r = -0.05 # step penalty (efficiency)
256
+ if done:
257
+ r += 1.0 if correct else -0.5 # decision accuracy
258
+ r += 0.3 * min(legitimate_flags, 3) # fraud signal detection
259
+ r += 0.5 * calibration_matrix[(confidence, correct)] # calibration bonus
260
+ return r
261
+ ```
262
+
263
+ ### Evaluation Reward (for demo and reporting only — do not use for GRPO)
264
+
265
+ ```python
266
+ def eval_reward(episode):
267
+ return (0.35 * calibration_reward # confidence accuracy
268
+ + 0.25 * escalation_reward # appropriate uncertainty escalation
269
+ + 0.20 * evidence_quality # grounded signal citations
270
+ + 0.10 * efficiency_score # step efficiency
271
+ - 0.10 * gaming_penalty) # anti-gaming deduction
272
+ ```
273
+
274
+ ### Anti-Gaming System
275
+
276
+ ```
277
+ if LOW_rate > 70% across 10+ episodes: penalty = (rate − 0.70) × 2.0
278
+ if HIGH_rate > 80% across 10+ episodes: penalty = (rate − 0.80) × 1.5
279
+ ```
280
+
281
+ ---
282
+
283
+ ## Training Pipeline
284
+
285
+ **Model:** `Qwen/Qwen2.5-0.5B-Instruct` — open-source, no OpenAI API
286
+ **Algorithm:** HF TRL `GRPOTrainer` + Unsloth 4-bit QLoRA (Group Relative Policy Optimization — same as DeepSeek-R1)
287
+ **Full run:** L4 GPU on HF Jobs — 5,000 episodes, 2,500 steps, 3 h 3 min
288
+ **Quick run:** Free Colab T4 GPU — 100 episodes, ~15 min (see notebook)
289
+ **WandB:** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl — pick the latest `grpo-qwen0.5b-env-connected` run. Plots in this README come from committed `reports/training_summary.json` (not from a pinned WandB run ID).
290
+
291
+ ```bash
292
+ # Reproduce the training run
293
+ git clone https://github.com/AniketAslaliya/debateFloor.git && cd debateFloor
294
+
295
+ # Use the canonical pinned requirements files (every dep verified to
296
+ # import inside train_minimal.py and the env server).
297
+ pip install -r requirements.txt # env server deps (FastAPI, openenv-core, ...)
298
+ pip install -r train/requirements.txt # training deps (trl, unsloth, peft, wandb, ...)
299
+
300
+ # Optional (Colab T4): swap the pinned unsloth for the colab-new wheel
301
+ # pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
302
+ #
303
+ # If you see: ModuleNotFoundError: No module named 'mergekit' when importing
304
+ # GRPOTrainer — you skipped train/requirements.txt. Re-run: pip install -r train/requirements.txt
305
+ # (mergekit is required by recent TRL for the GRPO import path.)
306
+
307
+ PYTHONPATH=. python train/train_minimal.py
308
+ ```
309
+
310
+ Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
311
+
312
+ Artifacts generated after training:
313
+ - `docs/reward_curve.svg`
314
+ - `docs/component_shift.svg`
315
+ - `reports/training_summary.json`
316
+
317
+ ---
318
+
319
+ ## Architecture & Code Map
320
+
321
+ **ClaimCourt** — after `git clone`, your working directory is `debateFloor/` (GitHub repo name; codename `debatefloor` in HF/WandB URLs).
322
+
323
+ ```
324
+ debateFloor/
325
+ ├── openenv.yaml ← OpenEnv spec manifest
326
+ ├── Dockerfile ← HF Space deployment
327
+ ├── requirements.txt
328
+
329
+ ├── app/ ← FastAPI server (OpenEnv contract)
330
+ │ ├── main.py ← /reset /step /state /tasks /health /schema
331
+ │ ├── environment.py ← InsuranceClaimEnvironment + Court Panel
332
+ │ ├── models.py ← Pydantic action/observation models
333
+ │ └── tasks.py ← task definitions
334
+
335
+ ├── server/ ← ClaimCourt core
336
+ │ ├── calibration_grader.py ← 3×2 matrix + anti-gaming + training/eval reward
337
+ │ └── claim_generator.py ← procedural episode generator (500+ episodes)
338
+
339
+ ├── train/
340
+ │ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
341
+ │ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
342
+
343
+ ├── docs/
344
+ │ ├── reward_curve.svg ← training reward curve (embedded above)
345
+ │ ├── component_shift.svg ← before/after component scores (embedded above)
346
+ │ └── HFBlogPost.md ← writeup
347
+
348
+ └── reports/
349
+ ├── training_summary.json
350
+ └── component_shift_summary.json
351
+ ```
352
+
353
+ ---
354
+
355
+ ## Quickstart
356
+
357
+ ### Run locally
358
+
359
+ ```bash
360
+ git clone https://github.com/AniketAslaliya/debateFloor.git
361
+ cd debateFloor
362
+ pip install -r requirements.txt
363
+ PYTHONPATH=. uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload
364
+ ```
365
+
366
+ ### Run with Docker
367
+
368
+ ```bash
369
+ docker build -t claimcourt .
370
+ docker run -p 7860:7860 claimcourt
371
+ ```
372
+
373
+ ---
374
+
375
+ ## API Reference
376
+
377
+ All endpoints follow the OpenEnv REST contract:
378
+
379
+ | Method | Endpoint | Description |
380
+ |--------|----------|-------------|
381
+ | `POST` | `/reset` | Start new episode. Accepts `task_id`, `seed`, `session_id`. |
382
+ | `POST` | `/step` | Submit action. Requires `session_id` and `action` body. |
383
+ | `GET` | `/state` | Current episode state. |
384
+ | `GET` | `/tasks` | Lists all tasks with objectives. |
385
+ | `GET` | `/schema` | JSON schema for action/observation/state. |
386
+ | `GET` | `/health` | Returns `{"status": "healthy", "active_sessions": N}`. |
387
+
388
+ ### Example Episode
389
+
390
+ ```python
391
+ import requests
392
+
393
+ BASE = "https://aniketasla-debatefloor.hf.space"
394
+
395
+ r = requests.post(f"{BASE}/reset", json={"task_id": "contradictory_claim", "seed": 42})
396
+ session_id = r.json()["session_id"]
397
+
398
+ def step(action):
399
+ return requests.post(f"{BASE}/step", json={"action": action, "session_id": session_id}).json()
400
+
401
+ step({"action_type": "validate_document", "parameters": {"doc_id": "DOC-001"}, "reasoning": "check bill"})
402
+ step({"action_type": "flag_fraud_signal", "parameters": {"flag_id": "procedure_mismatch",
403
+ "evidence": "discharge says appendectomy, bill says cardiac bypass"}, "reasoning": "billing fraud"})
404
+
405
+ resp = step({"action_type": "deny_claim", "confidence": "MED", "reason": "procedure mismatch confirmed"})
406
+ print(f"Reward: {resp['reward']}")
407
+ print(f"Calibration: {resp['observation']['reward_breakdown']['calibration_score']}")
408
+ ```
409
+
410
+ ---
411
+
412
+ ## OpenEnv Spec Compliance
413
+
414
+ | Requirement | Status |
415
+ |-------------|--------|
416
+ | `spec_version: 1` | ✅ |
417
+ | OpenEnv `Environment` base class | ✅ |
418
+ | `/reset`, `/step`, `/state`, `/tasks`, `/health`, `/schema` | ✅ |
419
+ | `supports_concurrent_sessions: true` | ✅ |
420
+ | `max_concurrent_envs: 64` | ✅ |
421
+ | `confidence_required: true` | ✅ |
422
+ | `procedural_generation: true` | ✅ |
423
+ | `episode_pool_size: 500` | ✅ |
424
+ | Reward in `[0.0, 1.0]` | ✅ |
425
+ | Docker deployment | ✅ |
426
+
427
+ ---
428
+
429
+ ## Team
430
+
431
+ - **Aniket Aslaliya** — Environment Core, Claim Generator, Calibration Grader, UI
432
+ - **Mitali Mehta** — Domain Knowledge (Fraud types, IRDAI regulations), Grader Design
433
+ - **Aditya Sharma** — Training Pipeline, GRPO Notebook, WandB Integration
434
+
435
+ ---
436
+
437
+ ## Citation
438
+
439
+ ```bibtex
440
+ @article{coca2025,
441
+ title={Co-optimizing Confidence and Accuracy via Segment-Specific GRPO Rewards},
442
+ author={...},
443
+ journal={arXiv:2603.05881},
444
+ year={2025}
445
+ }
446
+ ```
447
+
448
+ **Related:**
449
+ - CAPO paper (April 2026) — GRPO induces overconfidence; ClaimCourt is the fix
450
+ - OpenEnv: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
451
+ - TRL GRPOTrainer: [huggingface.co/docs/trl/grpo_trainer](https://huggingface.co/docs/trl/grpo_trainer)