Spaces:

AniketAsla
/

debatefloor

Running

App Files Files Community

debatefloor / docs /VideoScript_ClaimCourt.md

AniketAsla

sync: git 5dd32e4 (5dd32e4d821f0b6b1b3ed6297b85b15934752d0a)

7b42b28 verified 12 days ago

preview code

raw

history blame contribute delete

12.4 kB

ClaimCourt — Demo video script (non-technical + full UI)

Goal: Someone with no ML background understands what ClaimCourt is, why it matters, sees every major part of your live Space, and wants to open the link.
Length: ~1:55–2:00. Speak slowly; pause on numbers.
Brand on screen: ClaimCourt. URLs always use codename debatefloor (unchanged links).

Public demo URL (paste when live): Add your YouTube or Loom link here, then copy it into the README table row Demo walkthrough (video) so judges have a one-click watch.

Training proof (for one short segment): 5,000 practice claims, reward 0.13 → 0.47, held-out calibration 0 → 1 and decision accuracy 0 → 1 (see table at bottom).

One-line premise (say this if you freeze)

"ClaimCourt is a practice courtroom for AI on insurance claims — it learns not just what to decide, but how sure it should be."

ACT 1 — Why this exists (~25 s)

[0:00 – 0:08] Hook — money + mistake everyone makes

Visual: 2–3 full-screen title cards (no small text). Optional stock: busy hospital/claim desk silhouette.

Say:

"India loses a staggering amount to insurance fraud every year — on the order of eight to ten thousand crore rupees. A lot of that isn’t cartoon villains — it’s honest-looking paperwork with something wrong underneath. The expensive mistake isn’t only getting the answer wrong — it’s being sure when you shouldn’t be. We built ClaimCourt so an AI can practice that skill."

(Optional lower-third once: source — BCG × Medi Assist style reports; keep it readable.)

[0:08 – 0:25] What ClaimCourt is — no jargon first

Visual: Open ClaimCourt on Hugging Face full screen. Hero / top bar with ClaimCourt visible.

Say:

"You’re looking at ClaimCourt — a free, in-browser demo. Pick a fake insurance case. Watch an AI investigate it like an analyst: read documents, spot red flags, sometimes call a mini trial with two opposing voices. At the end it must approve, deny, or hand off to a human — and say whether it’s high, medium, or low confidence. Same rules every time. You can try the next three examples yourself — link at the end."

Avoid until later: “OpenEnv”, “GRPO”, “reward shaping” — introduce in Act 3 in one sentence each.

ACT 2 — The product tour: every UI piece (~55 s)

Use one continuous screen recording with a yellow cursor ring. Pause ~2–3 s on each labelled area below.

[0:25 – 0:35] Left column — “Run an Episode”

Visual: Run an Episode card. Open the dropdown: show all three:

Task (dropdown)	Plain-English pitch (say while hovering)
clean claim	“Everything lines up — the honest answer is approve, and you should sound sure.”
contradictory claim	“Documents fight each other — dates, costs, procedures don’t match. The AI should dig, then often deny — with medium confidence, not bravado.”
distribution shift claim	“Looks normal until you pull in linked claims — shared brokers, patterns. Here the right move is often hand to a human and say low confidence — because the full picture is murky.”

Say (short):

"Three levels of difficulty — easy, tricky, and ‘looks fine until you connect the dots’. Same button for all: Run Episode."

Click Run Episode once on clean claim so the audience sees the flow start.

[0:35 – 0:50] Middle — “Claim Under Investigation”

Visual: Claim card: ID, claimant name, incident line, document list (DOC-1, DOC-2…).

Say:

"Middle of the screen: the fake claim file — who it is, what happened, which PDFs exist. You’re not reading a research paper — you’re reading a case file. That’s deliberate: insurers think in cases, not equations."

[0:50 – 1:05] Right — “agent-trace.log” (the story of the investigation)

Visual: Scroll the agent-trace.log panel. Point at lines like validate_document, flag_fraud_signal, convene_debate_panel, final approve_claim / deny_claim / escalate_to_human with [CONF: HIGH] or [CONF: MED] or [CONF: LOW].

Say:

"Right side: a plain-English diary of what the AI did, step by step — not a black box. Each line is an action you could imagine a junior analyst taking: check this document, flag this inconsistency, call for a second opinion. That’s the transparency insurers actually need."

[1:05 – 1:15] Bottom-left — “LIVE METRICS”

Visual: LIVE METRICS: Reward (green number), Calibration score, Declared confidence pill (HIGH / MED / LOW), Steps taken. Optionally CORRECT badge when the outcome matches the scenario’s goal.

Say:

"Numbers on the left aren’t magic scores for geeks — think of reward as ‘**did the behaviour we want just go up?’ Calibration is ‘did its confidence match reality?**’ High confidence on an easy honest claim — good. High confidence on a murky ring-fraud case — bad. The UI makes that visible in one glance."

[1:15 – 1:25] “3×2 Calibration Matrix” — explain like a traffic light

Visual: The 3×2 Calibration Matrix card. Point at HIGH + Correct = +1 (highlighted), then HIGH + Wrong = −0.8 (red warning).

Say:

"This little grid is the rulebook for confidence. If you’re right and appropriately sure, you get the best score. If you’re wrong but you acted like a genius — that’s the worst cell: we penalise cocky mistakes harder than cautious ones. That single design choice is what teaches ‘honest uncertainty’ instead of fake confidence."

[1:25 – 1:40] “Multi-Agent Court Panel” — two lawyers in software

Visual: First the empty state (“run contradictory claim to see…”). Switch dropdown to contradictory claim, Run Episode, scroll until Court Panel Convened — Prosecutor (STRONG) vs Defender (WEAK) and the VERDICT bar.

Say:

"When the case is adversarial, the AI can open a court — not a gimmick, a stress test. One side argues fraud from the evidence we found; the other argues innocent explanations still exist. You see strong vs weak right on the card — then a recommended action. That’s how we stop one lazy headline from deciding someone’s claim."

[1:40 – 1:55] Third scenario — humility pays

Visual: distribution shift claim → Run Episode → trace with query_linked_claim, flag_fraud_signal, final escalate_to_human [CONF: LOW]. LIVE METRICS showing LOW confidence and a solid reward (e.g. ~0.7).

Say:

"Last trick: the fraud hides in links between claims — same broker, same pattern. The winning move isn’t bragging — it’s raising your hand: human needed, I’m only low confidence. ClaimCourt rewards that humility. In the real world, that’s fewer multi-crore mistakes."

ACT 3 — “Yes, we actually trained it” (~20 s) — keep light

Visual: Quick montage: WandB project page (reward climbing) or docs/reward_curve.png in GitHub; optional 1 clip of HF Jobs log line.

Say (one breath, then slow on numbers):

"We didn’t just draw a pretty UI. We ran the AI through five thousand practice claims on cloud GPUs. The training score — think ‘overall lesson learned’ — went from about zero-point-one-three to zero-point-four-seven. On held-out checks, decision accuracy and calibration both went from zero to perfect one-point-zero. Under the hood that’s reinforcement learning with Hugging Face’s TRL library — same family of tech behind recent open reasoning models. The details are in our README and mini-blog for anyone who wants to dig."

Numbers table (optional on-screen end card):

What we measure	Before training	After training
“Lesson learned” score (mean reward)	0.13	0.47
Decision matches the right action	0%	100%
Confidence matches reality	0%	100%
Catching fraud signals (partial credit)	0%	33%

ACT 4 — Close + try it (~15 s)

Visual: Full-screen end card. Large QR optional. Cursor hovers each line.

Say:

"If you work in risk, ops, or policy — or you’re just curious — open ClaimCourt, pick contradictory claim, hit Run Episode, and watch the trace and the court panel. If you’re a builder, everything is on GitHub under the codename debatefloor — links in the description. Try one case — that’s all it takes to see why this matters. Thank you."

Links (read slowly or show as text):

Try it: https://huggingface.co/spaces/AniketAsla/debatefloor
Code: https://github.com/AniketAslaliya/debateFloor
Mini-blog (markdown): https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md
Weights & Biases (all training runs): https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl

UI → script mapping (checklist so nothing is missing)

UI element	Act / time	Covered?
ClaimCourt header + Space chrome	Act 1	✓
Run an Episode + task dropdown (3 tasks)	Act 2	✓
Task description + Run Episode + CORRECT	Act 2	✓
Claim Under Investigation (ID, claimant, docs)	Act 2	✓
agent-trace.log (steps, CONF tags)	Act 2	✓
LIVE METRICS (Reward, Calibration, Confidence, Steps)	Act 2	✓
3×2 Calibration Matrix	Act 2	✓
Multi-Agent Court Panel (empty + live debate + verdict)	Act 2	✓
distribution_shift + linked claims + LOW + reward	Act 2	✓
Training proof + numbers	Act 3	✓
Links + “try one case” CTA	Act 4	✓

Optional segments (if you have +15 s)

Split screen (technical viewers only): Space left, app/main.py /step right — “same server answers the demo and the training job.” Skip for a general audience.
JSON overlay (2 s): tiny corner: request/response — proves it’s not canned video.

Production checklist

#	Do this	Why
1	Rehearse one full run per task so clicks are smooth	Saves retakes
2	1080p, clear browser zoom (~110%)	Readable on phones
3	Yellow cursor in OBS	Viewers follow the story
4	No facecam needed	Keeps focus on product
5	Export YouTube as public URL; no huge video in HF repo	Matches hackathon rules
6	1.5 s title card: `ClaimCourt — OpenEnv Hackathon India 2026`	Brand + context

Jargon one-liners (if you use a term, follow with this)

Term	One-liner for family & friends
OpenEnv	“A standard way to package ‘AI + environment + rules’ so researchers can compare apples to apples.”
GRPO / TRL	“Practice + score + repeat — like flight simulators for pilots, but for language models.”
Reward	“Did we like that behaviour? — summed up as a number.”
Calibration	“Was its confidence honest — not just lucky?”

Canonical stats (technical backup — same as repo JSON)

Source: reports/training_summary.json, reports/component_shift_summary.json — Qwen2.5-0.5B-Instruct, 5k episodes, 2500 GRPO steps, ~3h on L4.

Metric	Before	After
Mean training reward	0.130	0.469
Decision accuracy (eval)	0.00	1.00
Calibration (eval)	0.00	1.00
Fraud detection (eval)	0.00	0.33
Final train loss	—	~0.00565