debatefloor / docs /VideoScript_ClaimCourt.md
AniketAsla's picture
sync: git 5dd32e4 (5dd32e4d821f0b6b1b3ed6297b85b15934752d0a)
7b42b28 verified
# ClaimCourt — Demo video script (non-technical + full UI)
**Goal:** Someone with *no* ML background understands *what* ClaimCourt is, *why* it matters, sees *every major part* of your live Space, and wants to open the link.
**Length:** ~1:55–2:00. Speak slowly; pause on numbers.
**Brand on screen:** **ClaimCourt**. **URLs always use codename** `debatefloor` (unchanged links).
**Public demo URL (paste when live):** _Add your YouTube or Loom link here, then copy it into the README table row **Demo walkthrough (video)** so judges have a one-click watch._
**Training proof (for one short segment):** 5,000 practice claims, reward **0.13 → 0.47**, held-out **calibration 0 → 1** and **decision accuracy 0 → 1** (see table at bottom).
---
## One-line premise (say this if you freeze)
> "ClaimCourt is a **practice courtroom for AI on insurance claims** — it learns not just *what* to decide, but *how sure* it should be."
---
## ACT 1 — Why this exists (~25 s)
### [0:00 – 0:08] Hook — money + mistake everyone makes
**Visual:** 2–3 full-screen title cards (no small text). Optional stock: busy hospital/claim desk silhouette.
**Say:**
> "India loses a staggering amount to insurance fraud every year — on the order of **eight to ten thousand crore rupees**. A lot of that isn’t cartoon villains — it’s **honest-looking paperwork** with something wrong underneath. The expensive mistake isn’t only *getting the answer wrong* — it’s being **sure** when you shouldn’t be. We built **ClaimCourt** so an AI can practice that skill."
*(Optional lower-third once: source — BCG × Medi Assist style reports; keep it readable.)*
### [0:08 – 0:25] What ClaimCourt is — no jargon first
**Visual:** Open **[ClaimCourt on Hugging Face](https://huggingface.co/spaces/AniketAsla/debatefloor)** full screen. Hero / top bar with **ClaimCourt** visible.
**Say:**
> "You’re looking at **ClaimCourt** — a **free, in-browser demo**. Pick a fake insurance case. Watch an AI **investigate** it like an analyst: read documents, spot red flags, sometimes call a **mini trial** with two opposing voices. At the end it must **approve**, **deny**, or **hand off to a human** — and say whether it’s **high**, **medium**, or **low** confidence. Same rules every time. You can try the next three examples yourself — link at the end."
**Avoid until later:** “OpenEnv”, “GRPO”, “reward shaping” — introduce in Act 3 in one sentence each.
---
## ACT 2 — The product tour: every UI piece (~55 s)
Use **one continuous screen recording** with a **yellow cursor ring**. Pause ~2–3 s on each labelled area below.
### [0:25 – 0:35] Left column — “Run an Episode”
**Visual:** **Run an Episode** card. Open the **dropdown**: show all three:
| Task (dropdown) | Plain-English pitch (say while hovering) |
|-----------------|------------------------------------------|
| **clean claim** | “Everything lines up — the honest answer is *approve*, and you should sound **sure**.” |
| **contradictory claim** | “Documents **fight each other** — dates, costs, procedures don’t match. The AI should dig, then often **deny** — with **medium** confidence, not bravado.” |
| **distribution shift claim** | “Looks normal **until** you pull in **linked** claims — shared brokers, patterns. Here the *right* move is often **hand to a human** and say **low** confidence — because the full picture is murky.” |
**Say (short):**
> "Three levels of difficulty — **easy**, **tricky**, and **‘looks fine until you connect the dots’**. Same button for all: **Run Episode**."
Click **Run Episode** once on **clean claim** so the audience sees the flow start.
### [0:35 – 0:50] Middle — “Claim Under Investigation”
**Visual:** Claim card: **ID**, **claimant name**, **incident** line, **document list** (DOC-1, DOC-2…).
**Say:**
> "Middle of the screen: the **fake claim file** — who it is, what happened, which PDFs exist. You’re not reading a research paper — you’re reading a **case file**. That’s deliberate: insurers think in cases, not equations."
### [0:50 – 1:05] Right — “agent-trace.log” (the story of the investigation)
**Visual:** Scroll the **agent-trace.log** panel. Point at lines like `validate_document`, `flag_fraud_signal`, `convene_debate_panel`, final `approve_claim` / `deny_claim` / `escalate_to_human` with **`[CONF: HIGH]`** or **`[CONF: MED]`** or **`[CONF: LOW]`**.
**Say:**
> "Right side: a **plain-English diary** of what the AI did, step by step — not a black box. Each line is an action you could imagine a junior analyst taking: *check this document*, *flag this inconsistency*, *call for a second opinion*. That’s the transparency insurers actually need."
### [1:05 – 1:15] Bottom-left — “LIVE METRICS”
**Visual:** **LIVE METRICS**: **Reward** (green number), **Calibration score**, **Declared confidence** pill (HIGH / MED / LOW), **Steps taken**. Optionally **CORRECT** badge when the outcome matches the scenario’s goal.
**Say:**
> "Numbers on the left aren’t magic scores for geeks — think of **reward** as ‘**did the behaviour we want just go up?****Calibration** is ‘**did its confidence match reality?**’ High confidence on an easy honest claim — good. High confidence on a murky ring-fraud case — **bad**. The UI makes that visible in one glance."
### [1:15 – 1:25] “3×2 Calibration Matrix” — explain like a traffic light
**Visual:** The **3×2 Calibration Matrix** card. Point at **HIGH + Correct = +1** (highlighted), then **HIGH + Wrong = −0.8** (red warning).
**Say:**
> "This little grid is the **rulebook for confidence**. If you’re **right** and **appropriately sure**, you get the **best** score. If you’re **wrong** but you **acted like a genius** — that’s the **worst** cell: we penalise **cocky mistakes** harder than cautious ones. That single design choice is what teaches ‘**honest uncertainty**’ instead of fake confidence."
### [1:25 – 1:40] “Multi-Agent Court Panel” — two lawyers in software
**Visual:** First the **empty state** (“run contradictory claim to see…”). Switch dropdown to **contradictory claim**, **Run Episode**, scroll until **Court Panel Convened****Prosecutor (STRONG)** vs **Defender (WEAK)** and the **VERDICT** bar.
**Say:**
> "When the case is adversarial, the AI can **open a court** — not a gimmick, a **stress test**. One side argues *fraud from the evidence we found*; the other argues *innocent explanations still exist*. You see **strong** vs **weak** right on the card — then a **recommended action**. That’s how we stop one lazy headline from deciding someone’s claim."
### [1:40 – 1:55] Third scenario — humility pays
**Visual:** **distribution shift claim****Run Episode** → trace with **`query_linked_claim`**, **`flag_fraud_signal`**, final **`escalate_to_human [CONF: LOW]`**. LIVE METRICS showing **LOW** confidence and a solid **reward** (e.g. ~0.7).
**Say:**
> "Last trick: the fraud hides in **links between claims** — same broker, same pattern. The winning move isn’t bragging — it’s **raising your hand**: *human needed, I’m only low confidence*. ClaimCourt **rewards that humility**. In the real world, that’s fewer multi-crore mistakes."
---
## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
**Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line.
**Say (one breath, then slow on numbers):**
> "We didn’t just draw a pretty UI. We ran the AI through **five thousand** practice claims on cloud GPUs. The training score — think ‘**overall lesson learned**’ — went from about **zero-point-one-three** to **zero-point-four-seven**. On held-out checks, **decision accuracy** and **calibration** both went from **zero to perfect one-point-zero**. Under the hood that’s **reinforcement learning** with Hugging Face’s **TRL** library — same family of tech behind recent open reasoning models. The details are in our **README** and **mini-blog** for anyone who wants to dig."
**Numbers table (optional on-screen end card):**
| What we measure | Before training | After training |
|-----------------|-----------------|----------------|
| “Lesson learned” score (mean reward) | **0.13** | **0.47** |
| Decision matches the right action | **0%** | **100%** |
| Confidence matches reality | **0%** | **100%** |
| Catching fraud signals (partial credit) | **0%** | **33%** |
---
## ACT 4 — Close + try it (~15 s)
**Visual:** Full-screen end card. Large QR optional. Cursor hovers each line.
**Say:**
> "If you work in risk, ops, or policy — or you’re just curious — **open ClaimCourt**, pick **contradictory claim**, hit **Run Episode**, and watch the trace and the court panel. If you’re a builder, everything is on **GitHub** under the codename **debatefloor** — links in the description. **Try one case** — that’s all it takes to see why this matters. Thank you."
**Links (read slowly or show as text):**
- **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
- **Code:** https://github.com/AniketAslaliya/debateFloor
- **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md
- **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
---
## UI → script mapping (checklist so nothing is missing)
| UI element | Act / time | Covered? |
|------------|------------|----------|
| ClaimCourt header + Space chrome | Act 1 | ✓ |
| Run an Episode + **task dropdown** (3 tasks) | Act 2 | ✓ |
| Task description + **Run Episode** + **CORRECT** | Act 2 | ✓ |
| **Claim Under Investigation** (ID, claimant, docs) | Act 2 | ✓ |
| **agent-trace.log** (steps, CONF tags) | Act 2 | ✓ |
| **LIVE METRICS** (Reward, Calibration, Confidence, Steps) | Act 2 | ✓ |
| **3×2 Calibration Matrix** | Act 2 | ✓ |
| **Multi-Agent Court Panel** (empty + **live debate + verdict**) | Act 2 | ✓ |
| **distribution_shift** + linked claims + **LOW** + reward | Act 2 | ✓ |
| Training proof + numbers | Act 3 | ✓ |
| Links + “try one case” CTA | Act 4 | ✓ |
---
## Optional segments (if you have +15 s)
- **Split screen (technical viewers only):** Space left, `app/main.py` `/step` right — *“same server answers the demo and the training job.”* Skip for a general audience.
- **JSON overlay (2 s):** tiny corner: request/response — proves it’s not canned video.
---
## Production checklist
| # | Do this | Why |
|---|---------|-----|
| 1 | Rehearse **one full run** per task so clicks are smooth | Saves retakes |
| 2 | **1080p**, clear browser zoom (~110%) | Readable on phones |
| 3 | **Yellow cursor** in OBS | Viewers follow the story |
| 4 | **No facecam** needed | Keeps focus on product |
| 5 | Export **YouTube** as public URL; **no** huge video in HF repo | Matches hackathon rules |
| 6 | **1.5 s** title card: `ClaimCourt — OpenEnv Hackathon India 2026` | Brand + context |
---
## Jargon one-liners (if you use a term, follow with this)
| Term | One-liner for family & friends |
|------|--------------------------------|
| OpenEnv | “A standard way to package ‘**AI + environment + rules**’ so researchers can compare apples to apples.” |
| GRPO / TRL | “**Practice + score + repeat** — like flight simulators for pilots, but for language models.” |
| Reward | “**Did we like that behaviour?** — summed up as a number.” |
| Calibration | “**Was its confidence honest** — not just lucky?” |
---
## Canonical stats (technical backup — same as repo JSON)
Source: `reports/training_summary.json`, `reports/component_shift_summary.json` — Qwen2.5-0.5B-Instruct, 5k episodes, 2500 GRPO steps, ~3h on L4.
| Metric | Before | After |
|--------|--------|-------|
| Mean training reward | 0.130 | 0.469 |
| Decision accuracy (eval) | 0.00 | 1.00 |
| Calibration (eval) | 0.00 | 1.00 |
| Fraud detection (eval) | 0.00 | 0.33 |
| Final train loss | — | ~0.00565 |