Spaces:

AniketAsla
/

debatefloor

Running

App Files Files Community

debatefloor / README.md

AniketAsla

sync: git 6c015af (6c015af5a56716c7792e9b6ff9612dc91d901483)

4770478 verified 12 days ago

preview code

raw

history blame contribute delete

13.7 kB

metadata

title: ClaimCourt — Insurance Calibration RL Environment
emoji: ⚖️
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: true

ClaimCourt — Insurance Calibration RL Environment

Repository codename debatefloor — all GitHub, Hugging Face Space, and model-repo URLs use the original codename so existing links resolve. The product is ClaimCourt everywhere it faces a human reader.

An OpenEnv-compliant RL training environment where AI agents investigate insurance claims, argue in an adversarial Court Panel, and must declare calibrated confidence before every terminal decision. Built for the Meta PyTorch × Scaler OpenEnv Hackathon Grand Finale, April 25–26 2026.

🎯 Headline result — Calibration score 0.000 → 1.000 on held-out claims

Across a 5,000-episode GRPO run on Qwen2.5-0.5B-Instruct, the trained agent's confidence now matches its correctness on every held-out terminal action — directly attacking the GRPO overconfidence pathology documented in CAPO (arXiv:2604.12632). Decision accuracy moved 0.000 → 1.000 on the same eval. Both numbers read straight from reports/component_shift_summary.json — no hand-edits.

📺 Watch the 2-minute demo first

▶️ YouTube — ClaimCourt demo — see the full Court Panel sequence and the calibration matrix lighting up in real time.

All artifacts in one table

Artifact	Link
Demo Video (≤2 min)	https://www.youtube.com/watch?v=Uk8sSLywEpE
Live Environment (HF Space)	https://huggingface.co/spaces/AniketAsla/debatefloor
Mini-Blog (full writeup)	BLOG.md
Trained Model	https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct
Training Notebook (Colab)	train/train_debatefloor.ipynb
WandB workspace (training curves)	https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
GitHub repo	https://github.com/AniketAslaliya/debateFloor

Hackathon theme alignment

Theme	Fit	Why
#3.1 World Modeling — Professional Tasks (primary)	✅✅✅	Insurance claim adjudication is a textbook enterprise workflow: 11 investigative tool-calls (`validate_document`, `query_historical_data`, `verify_identity`, …), partially observable state where fraud signals are hidden until queried, multi-step orchestration (10–28 steps per episode), and an explicit anti-gaming detector that prevents shortcuts. The agent must maintain consistent internal state, update beliefs as evidence arrives, and orchestrate the workflow toward a calibrated terminal decision.
#1 Multi-Agent Interactions (secondary)	✅✅	The Court Panel (`convene_debate_panel`) spins up an adversarial Prosecutor + Defender pair from the existing evidence base. The Judge must model both adversaries' incentives and weigh their competing arguments before declaring HIGH/MED/LOW confidence — Fleet AI Scalable Oversight applied to claims work.

1. The Problem

LLMs in high-stakes domains suffer from a documented failure mode: overconfidence. A model that approves or denies an insurance claim with 100 % certainty — but is wrong — causes real harm. The CAPO paper (arXiv:2604.12632, 2026) measures up to a 15 % AUC drop in standard GRPO training. DCPO (arXiv:2603.09117, 2026) shows a 71 % Expected-Calibration-Error reduction is achievable when calibration is treated as a first-class objective.

Why this matters now. Indian health-insurance fraud, waste & abuse drains ₹8,000–10,000 crore every year (BCG × Medi Assist, Nov 2025) — about 8 % of all claim payouts. From April 2026, the IRDAI Insurance Fraud Monitoring Framework Guidelines, 2025 make every insurer legally responsible for catching it. The obvious tool is AI — but standard RL training pushes models in exactly the wrong direction.

ClaimCourt is the direct fix. A reward surface that penalises overconfident wrong answers more severely than uncertain ones, teaching models when to be confident, not just what to say.

2. The Environment — what the agent sees, does, and gets rewarded for

What the agent sees

A claim object: claimant identity, incident details, policy history, attached documents, and the list of available actions. After every action: updated documents, discovered fraud signals, action history, and a partial reward breakdown.

What the agent does

The agent investigates step-by-step before committing.

Action class	Examples	Confidence required?
Investigative	`validate_document`, `flag_fraud_signal`, `query_historical_data`, `query_linked_claim`, `verify_identity`, `convene_debate_panel`	No
Terminal	`approve_claim`, `deny_claim`, `escalate_to_human`, `request_investigation`	Yes — HIGH / MED / LOW

What the agent gets rewarded for — the 3×2 Calibration Matrix (the core innovation)

Before every terminal action, the agent must declare a confidence level. The reward is determined by this matrix:

Confidence	Correct Decision	Wrong Decision
HIGH	+1.0	−0.8 ← worst outcome
MED	+0.6	−0.2
LOW	+0.1	0.0 ← safe

An agent that always says HIGH to maximise reward is catastrophically punished when wrong. An agent that always says LOW is caught by the anti-gaming detector (LOW-rate > 70 % across 10+ episodes triggers a progressive penalty). The only winning strategy is accurate calibration — based on the CoCA framework (arXiv:2603.05881).

The Court Panel — the demo centrepiece

No other environment in the OpenEnv hub has this mechanic. Run contradictory_claim in the live Space to see it.

When evidence is mixed, the agent calls convene_debate_panel. Two adversarial sub-agents spin up from the existing evidence base:

INVESTIGATOR
├── validate_document      → discovers fraud signals
├── query_historical_data  → reveals cross-claim patterns
└── convene_debate_panel
        ↓
┌──────────────────┐    ┌──────────────────┐
│  PROSECUTOR      │    │  DEFENDER        │
│  Strength: STRONG│    │  Strength: WEAK  │
└──────────────────┘    └──────────────────┘
        ↓
    PANEL VERDICT → recommendation
        ↓
    JUDGE: approve / deny / escalate
    + confidence: HIGH / MED / LOW

The Court Panel forces the agent to expose its reasoning to a programmatic adversary before declaring confidence — Fleet AI Scalable Oversight, applied to claims work.

Procedural generation

Episodes are generated procedurally — 5 fraud types × 4 coverage types × 3 jurisdictions × seed variation = 500+ unique episodes. Same seed → same episode, so reviewers can reproduce exactly what the model saw.

3. Results

5,000-episode GRPO run — Qwen2.5-0.5B-Instruct, L4 GPU, 3 h 3 min

All numbers below are read directly from committed JSON artifacts (reports/training_summary.json, reports/component_shift_summary.json) — no hand-edits.

Three headline numbers

Training reward: 0.130 → 0.469 (3.6× improvement) across 2,500 GRPO steps
Held-out decision accuracy: 0.000 → 1.000 — the trained model gets every held-out claim right
Held-out calibration score: 0.000 → 1.000 — confidence now matches correctness on every terminal action

Held-out evaluation (n = 6 episodes: 3 tasks × 2 seeds, live HTTP `/step`)

Component	Before (untrained)	After (GRPO)	Change
Decision accuracy	0.000	1.000	+1.000
Calibration	0.000	1.000	+1.000
Fraud detection	0.000	0.333	+0.333
Evidence quality	0.333	0.333	unchanged
Reasoning quality	0.833	0.792	−0.042 (within noise)

Training plots

X: training step. Y-left: training loss. Y-right: mean reward (training scalar, unbounded). Mean training reward climbs from 0.130 → 0.469 across 2,500 GRPO steps (5,000 episodes, 1 epoch).

Same axes, before vs after on held-out eval (n=6 episodes). Decision accuracy 0 → 1.0, Calibration 0 → 1.0, Fraud detection 0 → 0.33. The lift on the trained metrics is unmissable.

The trained model learned to make correct decisions with calibrated confidence — exactly the skill this environment is designed to teach. The small dip in reasoning quality (−4 pts) is the only trade-off: the model traded a sliver of fluency for sharper decision-making.

4. Why It Matters

Calibration failure is universal. Every high-stakes domain where an AI must know the limits of its own knowledge has it: medical diagnosis, legal analysis, financial advice, autonomous systems. ClaimCourt is a blueprint for training epistemic humility into LLMs at the reward level, not the prompt level.

Insurance is just the first domain. The 3×2 matrix transfers anywhere a model must commit a binary decision and own the confidence behind it.

Quick Start for Reviewers (3 minutes)

Watch the demo — ▶️ YouTube (≤2 min)
Open the live UI — https://huggingface.co/spaces/AniketAsla/debatefloor
Select contradictory_claim → click Run Episode.
Watch the agent: validate documents → flag fraud signals → convene the Court Panel → declare MED confidence → deny claim.
The highlighted cell in the 3×2 matrix shows exactly why it scored what it scored.

Reproduce the training

git clone https://github.com/AniketAslaliya/debateFloor.git && cd debateFloor
pip install -r requirements.txt          # env server deps
pip install -r train/requirements.txt    # training deps (trl, unsloth, peft, wandb)
PYTHONPATH=. python train/train_minimal.py

Or open the Colab notebook.

Repo layout

debateFloor/
├── openenv.yaml              ← OpenEnv spec manifest (5 tasks, 3×2 matrix)
├── Dockerfile                ← HF Space deployment
├── BLOG.md                   ← Mini-blog (full writeup)
├── app/                      ← FastAPI server: /reset /step /state /tasks /health /schema
├── server/                   ← ClaimCourt core: calibration_grader.py, claim_generator.py
├── train/                    ← train_minimal.py + Colab notebook
├── docs/                     ← reward_curve.png, component_shift.png
└── reports/                  ← training_summary.json, component_shift_summary.json

OpenEnv compliance: spec_version: 1, OpenEnv Environment base class, /reset /step /state /tasks /health /schema, supports_concurrent_sessions: true, max_concurrent_envs: 64, reward in [0.0, 1.0], Docker deployment — full manifest in openenv.yaml.

Team

Aniket Aslaliya — Environment Core, Claim Generator, Calibration Grader, UI
Mitali Mehta — Domain Knowledge (fraud types, IRDAI regulations), Grader Design
Aditya Sharma — Training Pipeline, GRPO Notebook, WandB Integration

References

CAPO — Confidence-Aware Policy Optimization (arXiv:2604.12632, 2026) — documents the GRPO overconfidence pathology ClaimCourt fixes
DCPO — Distribution-Calibrated Policy Optimization (arXiv:2603.09117, 2026) — proves calibration improvement is achievable as a training objective
CoCA — Co-optimising Confidence and Accuracy via segment-specific GRPO rewards (arXiv:2603.05881)
OpenEnv — github.com/meta-pytorch/OpenEnv
TRL GRPOTrainer — huggingface.co/docs/trl/grpo_trainer