Spaces:

AniketAsla
/

debatefloor

Running

App Files Files Community

debatefloor / README.md

AniketAsla

sync: git 6c015af (6c015af5a56716c7792e9b6ff9612dc91d901483)

4770478 verified 12 days ago

preview code

raw

history blame contribute delete

13.7 kB

	---
	title: ClaimCourt — Insurance Calibration RL Environment
	emoji: ⚖️
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: true
	---

	# ClaimCourt — Insurance Calibration RL Environment

	> Repository codename `debatefloor` — all GitHub, Hugging Face Space, and model-repo URLs use the original codename so existing links resolve. The product is ClaimCourt* everywhere it faces a human reader.*

	[![Live Demo](https://img.shields.io/badge/Live%20Demo-Hugging%20Face-orange)](https://huggingface.co/spaces/AniketAsla/debatefloor)
	[![Demo Video](https://img.shields.io/badge/Demo%20Video-YouTube-red)](https://www.youtube.com/watch?v=Uk8sSLywEpE)
	[![Based on CAPO](https://img.shields.io/badge/Based%20on-CAPO%20arXiv%3A2604.12632-red)](https://arxiv.org/abs/2604.12632)
	[![WandB](https://img.shields.io/badge/WandB-Run%20workspace-yellow)](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl)
	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)

	> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL training environment where AI agents investigate insurance claims, argue in an adversarial Court Panel, and must declare calibrated confidence before every terminal decision.
	> Built for the Meta PyTorch × Scaler OpenEnv Hackathon Grand Finale, April 25–26 2026.

	> ### 🎯 Headline result — Calibration score 0.000 → 1.000 on held-out claims
	>
	> Across a 5,000-episode GRPO run on Qwen2.5-0.5B-Instruct, the trained agent's confidence now matches its correctness on every held-out terminal action — directly attacking the GRPO overconfidence pathology documented in [CAPO (arXiv:2604.12632)](https://arxiv.org/abs/2604.12632). Decision accuracy moved 0.000 → 1.000 on the same eval. Both numbers read straight from [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — no hand-edits.

	---

	## 📺 Watch the 2-minute demo first

	▶️ [YouTube — ClaimCourt demo](https://www.youtube.com/watch?v=Uk8sSLywEpE) — see the full Court Panel sequence and the calibration matrix lighting up in real time.

	## All artifacts in one table

	\| Artifact \| Link \|
	\|---\|---\|
	\| Demo Video (≤2 min) \| https://www.youtube.com/watch?v=Uk8sSLywEpE \|
	\| Live Environment (HF Space) \| https://huggingface.co/spaces/AniketAsla/debatefloor \|
	\| Mini-Blog (full writeup) \| [BLOG.md](BLOG.md) \|
	\| Trained Model \| https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct \|
	\| Training Notebook (Colab) \| [train/train_debatefloor.ipynb](train/train_debatefloor.ipynb) \|
	\| WandB workspace (training curves) \| https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl \|
	\| GitHub repo \| https://github.com/AniketAslaliya/debateFloor \|

	---

	## Hackathon theme alignment

	\| Theme \| Fit \| Why \|
	\|---\|---\|---\|
	\| #3.1 World Modeling — Professional Tasks (primary) \| ✅✅✅ \| Insurance claim adjudication is a textbook enterprise workflow: 11 investigative tool-calls (`validate_document`, `query_historical_data`, `verify_identity`, …), partially observable state where fraud signals are hidden until queried, multi-step orchestration (10–28 steps per episode), and an explicit anti-gaming detector that prevents shortcuts. The agent must maintain consistent internal state, update beliefs as evidence arrives, and orchestrate the workflow toward a calibrated terminal decision. \|
	\| #1 Multi-Agent Interactions (secondary) \| ✅✅ \| The Court Panel (`convene_debate_panel`) spins up an adversarial Prosecutor + Defender pair from the existing evidence base. The Judge must model both adversaries' incentives and weigh their competing arguments before declaring HIGH/MED/LOW confidence — Fleet AI Scalable Oversight applied to claims work. \|

	---

	## 1. The Problem

	LLMs in high-stakes domains suffer from a documented failure mode: overconfidence. A model that approves or denies an insurance claim with 100 % certainty — but is wrong — causes real harm. The [CAPO paper (arXiv:2604.12632, 2026)](https://arxiv.org/abs/2604.12632) measures up to a 15 % AUC drop in standard GRPO training. [DCPO (arXiv:2603.09117, 2026)](https://arxiv.org/abs/2603.09117) shows a 71 % Expected-Calibration-Error reduction is achievable when calibration is treated as a first-class objective.

	Why this matters now. Indian health-insurance fraud, waste & abuse drains ₹8,000–10,000 crore every year ([BCG × Medi Assist, Nov 2025](https://www.business-standard.com/industry/news/insurance-fwa-drains-rs10000cr-each-year-bcg-mediassist-report-125112101199_1.html)) — about 8 % of all claim payouts. From April 2026, the [IRDAI Insurance Fraud Monitoring Framework Guidelines, 2025](https://irdai.gov.in/) make every insurer legally responsible for catching it. The obvious tool is AI — but standard RL training pushes models in exactly the wrong direction.

	ClaimCourt is the direct fix. A reward surface that penalises overconfident wrong answers more severely than uncertain ones, teaching models when to be confident, not just what to say.

	---

	## 2. The Environment — what the agent sees, does, and gets rewarded for

	### What the agent sees
	A claim object: claimant identity, incident details, policy history, attached documents, and the list of available actions. After every action: updated documents, discovered fraud signals, action history, and a partial reward breakdown.

	### What the agent does
	The agent investigates step-by-step before committing.

	\| Action class \| Examples \| Confidence required? \|
	\|---\|---\|---\|
	\| Investigative \| `validate_document`, `flag_fraud_signal`, `query_historical_data`, `query_linked_claim`, `verify_identity`, `convene_debate_panel` \| No \|
	\| Terminal \| `approve_claim`, `deny_claim`, `escalate_to_human`, `request_investigation` \| Yes — HIGH / MED / LOW \|

	### What the agent gets rewarded for — the 3×2 Calibration Matrix (the core innovation)

	Before every terminal action, the agent must declare a confidence level. The reward is determined by this matrix:

	\| Confidence \| Correct Decision \| Wrong Decision \|
	\|---\|---\|---\|
	\| HIGH \| +1.0 \| −0.8 ← worst outcome \|
	\| MED \| +0.6 \| −0.2 \|
	\| LOW \| +0.1 \| 0.0 ← safe \|

	An agent that always says HIGH to maximise reward is catastrophically punished when wrong. An agent that always says LOW is caught by the anti-gaming detector (LOW-rate > 70 % across 10+ episodes triggers a progressive penalty). The only winning strategy is accurate calibration — based on the [CoCA framework (arXiv:2603.05881)](https://arxiv.org/abs/2603.05881).

	### The Court Panel — the demo centrepiece

	> No other environment in the OpenEnv hub has this mechanic. Run `contradictory_claim` in the live Space to see it.

	When evidence is mixed, the agent calls `convene_debate_panel`. Two adversarial sub-agents spin up from the existing evidence base:

	```
	INVESTIGATOR
	├── validate_document → discovers fraud signals
	├── query_historical_data → reveals cross-claim patterns
	└── convene_debate_panel
	↓
	┌──────────────────┐ ┌──────────────────┐
	│ PROSECUTOR │ │ DEFENDER │
	│ Strength: STRONG│ │ Strength: WEAK │
	└──────────────────┘ └──────────────────┘
	↓
	PANEL VERDICT → recommendation
	↓
	JUDGE: approve / deny / escalate
	+ confidence: HIGH / MED / LOW
	```

	The Court Panel forces the agent to expose its reasoning to a programmatic adversary before declaring confidence — Fleet AI Scalable Oversight, applied to claims work.

	### Procedural generation
	Episodes are generated procedurally — 5 fraud types × 4 coverage types × 3 jurisdictions × seed variation = 500+ unique episodes. Same seed → same episode, so reviewers can reproduce exactly what the model saw.

	---

	## 3. Results

	### 5,000-episode GRPO run — Qwen2.5-0.5B-Instruct, L4 GPU, 3 h 3 min

	All numbers below are read directly from committed JSON artifacts ([`reports/training_summary.json`](reports/training_summary.json), [`reports/component_shift_summary.json`](reports/component_shift_summary.json)) — no hand-edits.

	#### Three headline numbers

	- Training reward: 0.130 → 0.469 (3.6× improvement) across 2,500 GRPO steps
	- Held-out decision accuracy: 0.000 → 1.000 — the trained model gets every held-out claim right
	- Held-out calibration score: 0.000 → 1.000 — confidence now matches correctness on every terminal action

	#### Held-out evaluation (n = 6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)

	\| Component \| Before (untrained) \| After (GRPO) \| Change \|
	\|---\|---:\|---:\|---\|
	\| Decision accuracy \| 0.000 \| 1.000 \| +1.000 \|
	\| Calibration \| 0.000 \| 1.000 \| +1.000 \|
	\| Fraud detection \| 0.000 \| 0.333 \| +0.333 \|
	\| Evidence quality \| 0.333 \| 0.333 \| unchanged \|
	\| Reasoning quality \| 0.833 \| 0.792 \| −0.042 (within noise) \|

	#### Training plots

	![Reward Curve](docs/reward_curve.png)
	X: training step. Y-left: training loss. Y-right: mean reward (training scalar, unbounded). Mean training reward climbs from 0.130 → 0.469 across 2,500 GRPO steps (5,000 episodes, 1 epoch).

	![Component Shift](docs/component_shift.png)
	Same axes, before vs after on held-out eval (n=6 episodes). Decision accuracy 0 → 1.0, Calibration 0 → 1.0, Fraud detection 0 → 0.33. The lift on the trained metrics is unmissable.

	The trained model learned to make correct decisions with calibrated confidence — exactly the skill this environment is designed to teach. The small dip in reasoning quality (−4 pts) is the only trade-off: the model traded a sliver of fluency for sharper decision-making.

	---

	## 4. Why It Matters

	Calibration failure is universal. Every high-stakes domain where an AI must know the limits of its own knowledge has it: medical diagnosis, legal analysis, financial advice, autonomous systems. ClaimCourt is a blueprint for training epistemic humility into LLMs at the reward level, not the prompt level.

	Insurance is just the first domain. The 3×2 matrix transfers anywhere a model must commit a binary decision and own the confidence behind it.

	---

	## Quick Start for Reviewers (3 minutes)

	1. Watch the demo — [▶️ YouTube (≤2 min)](https://www.youtube.com/watch?v=Uk8sSLywEpE)
	2. Open the live UI — https://huggingface.co/spaces/AniketAsla/debatefloor
	3. Select `contradictory_claim` → click Run Episode.
	4. Watch the agent: validate documents → flag fraud signals → convene the Court Panel → declare MED confidence → deny claim.
	5. The highlighted cell in the 3×2 matrix shows exactly why it scored what it scored.

	### Reproduce the training

	```bash
	git clone https://github.com/AniketAslaliya/debateFloor.git && cd debateFloor
	pip install -r requirements.txt # env server deps
	pip install -r train/requirements.txt # training deps (trl, unsloth, peft, wandb)
	PYTHONPATH=. python train/train_minimal.py
	```

	Or open the [Colab notebook](https://colab.research.google.com/github/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb).

	---

	## Repo layout

	```
	debateFloor/
	├── openenv.yaml ← OpenEnv spec manifest (5 tasks, 3×2 matrix)
	├── Dockerfile ← HF Space deployment
	├── BLOG.md ← Mini-blog (full writeup)
	├── app/ ← FastAPI server: /reset /step /state /tasks /health /schema
	├── server/ ← ClaimCourt core: calibration_grader.py, claim_generator.py
	├── train/ ← train_minimal.py + Colab notebook
	├── docs/ ← reward_curve.png, component_shift.png
	└── reports/ ← training_summary.json, component_shift_summary.json
	```

	OpenEnv compliance: `spec_version: 1`, OpenEnv `Environment` base class, `/reset` `/step` `/state` `/tasks` `/health` `/schema`, `supports_concurrent_sessions: true`, `max_concurrent_envs: 64`, reward in `[0.0, 1.0]`, Docker deployment — full manifest in [`openenv.yaml`](openenv.yaml).

	---

	## Team

	- Aniket Aslaliya — Environment Core, Claim Generator, Calibration Grader, UI
	- Mitali Mehta — Domain Knowledge (fraud types, IRDAI regulations), Grader Design
	- Aditya Sharma — Training Pipeline, GRPO Notebook, WandB Integration

	---

	## References

	- CAPO — Confidence-Aware Policy Optimization ([arXiv:2604.12632](https://arxiv.org/abs/2604.12632), 2026) — documents the GRPO overconfidence pathology ClaimCourt fixes
	- DCPO — Distribution-Calibrated Policy Optimization ([arXiv:2603.09117](https://arxiv.org/abs/2603.09117), 2026) — proves calibration improvement is achievable as a training objective
	- CoCA — Co-optimising Confidence and Accuracy via segment-specific GRPO rewards ([arXiv:2603.05881](https://arxiv.org/abs/2603.05881))
	- OpenEnv — [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
	- TRL `GRPOTrainer` — [huggingface.co/docs/trl/grpo_trainer](https://huggingface.co/docs/trl/grpo_trainer)