# ClaimCourt — Demo video script (non-technical + full UI) **Goal:** Someone with *no* ML background understands *what* ClaimCourt is, *why* it matters, sees *every major part* of your live Space, and wants to open the link. **Length:** ~1:55–2:00. Speak slowly; pause on numbers. **Brand on screen:** **ClaimCourt**. **URLs always use codename** `debatefloor` (unchanged links). **Public demo URL (paste when live):** _Add your YouTube or Loom link here, then copy it into the README table row **Demo walkthrough (video)** so judges have a one-click watch._ **Training proof (for one short segment):** 5,000 practice claims, reward **0.13 → 0.47**, held-out **calibration 0 → 1** and **decision accuracy 0 → 1** (see table at bottom). --- ## One-line premise (say this if you freeze) > "ClaimCourt is a **practice courtroom for AI on insurance claims** — it learns not just *what* to decide, but *how sure* it should be." --- ## ACT 1 — Why this exists (~25 s) ### [0:00 – 0:08] Hook — money + mistake everyone makes **Visual:** 2–3 full-screen title cards (no small text). Optional stock: busy hospital/claim desk silhouette. **Say:** > "India loses a staggering amount to insurance fraud every year — on the order of **eight to ten thousand crore rupees**. A lot of that isn’t cartoon villains — it’s **honest-looking paperwork** with something wrong underneath. The expensive mistake isn’t only *getting the answer wrong* — it’s being **sure** when you shouldn’t be. We built **ClaimCourt** so an AI can practice that skill." *(Optional lower-third once: source — BCG × Medi Assist style reports; keep it readable.)* ### [0:08 – 0:25] What ClaimCourt is — no jargon first **Visual:** Open **[ClaimCourt on Hugging Face](https://huggingface.co/spaces/AniketAsla/debatefloor)** full screen. Hero / top bar with **ClaimCourt** visible. **Say:** > "You’re looking at **ClaimCourt** — a **free, in-browser demo**. Pick a fake insurance case. Watch an AI **investigate** it like an analyst: read documents, spot red flags, sometimes call a **mini trial** with two opposing voices. At the end it must **approve**, **deny**, or **hand off to a human** — and say whether it’s **high**, **medium**, or **low** confidence. Same rules every time. You can try the next three examples yourself — link at the end." **Avoid until later:** “OpenEnv”, “GRPO”, “reward shaping” — introduce in Act 3 in one sentence each. --- ## ACT 2 — The product tour: every UI piece (~55 s) Use **one continuous screen recording** with a **yellow cursor ring**. Pause ~2–3 s on each labelled area below. ### [0:25 – 0:35] Left column — “Run an Episode” **Visual:** **Run an Episode** card. Open the **dropdown**: show all three: | Task (dropdown) | Plain-English pitch (say while hovering) | |-----------------|------------------------------------------| | **clean claim** | “Everything lines up — the honest answer is *approve*, and you should sound **sure**.” | | **contradictory claim** | “Documents **fight each other** — dates, costs, procedures don’t match. The AI should dig, then often **deny** — with **medium** confidence, not bravado.” | | **distribution shift claim** | “Looks normal **until** you pull in **linked** claims — shared brokers, patterns. Here the *right* move is often **hand to a human** and say **low** confidence — because the full picture is murky.” | **Say (short):** > "Three levels of difficulty — **easy**, **tricky**, and **‘looks fine until you connect the dots’**. Same button for all: **Run Episode**." Click **Run Episode** once on **clean claim** so the audience sees the flow start. ### [0:35 – 0:50] Middle — “Claim Under Investigation” **Visual:** Claim card: **ID**, **claimant name**, **incident** line, **document list** (DOC-1, DOC-2…). **Say:** > "Middle of the screen: the **fake claim file** — who it is, what happened, which PDFs exist. You’re not reading a research paper — you’re reading a **case file**. That’s deliberate: insurers think in cases, not equations." ### [0:50 – 1:05] Right — “agent-trace.log” (the story of the investigation) **Visual:** Scroll the **agent-trace.log** panel. Point at lines like `validate_document`, `flag_fraud_signal`, `convene_debate_panel`, final `approve_claim` / `deny_claim` / `escalate_to_human` with **`[CONF: HIGH]`** or **`[CONF: MED]`** or **`[CONF: LOW]`**. **Say:** > "Right side: a **plain-English diary** of what the AI did, step by step — not a black box. Each line is an action you could imagine a junior analyst taking: *check this document*, *flag this inconsistency*, *call for a second opinion*. That’s the transparency insurers actually need." ### [1:05 – 1:15] Bottom-left — “LIVE METRICS” **Visual:** **LIVE METRICS**: **Reward** (green number), **Calibration score**, **Declared confidence** pill (HIGH / MED / LOW), **Steps taken**. Optionally **CORRECT** badge when the outcome matches the scenario’s goal. **Say:** > "Numbers on the left aren’t magic scores for geeks — think of **reward** as ‘**did the behaviour we want just go up?**’ **Calibration** is ‘**did its confidence match reality?**’ High confidence on an easy honest claim — good. High confidence on a murky ring-fraud case — **bad**. The UI makes that visible in one glance." ### [1:15 – 1:25] “3×2 Calibration Matrix” — explain like a traffic light **Visual:** The **3×2 Calibration Matrix** card. Point at **HIGH + Correct = +1** (highlighted), then **HIGH + Wrong = −0.8** (red warning). **Say:** > "This little grid is the **rulebook for confidence**. If you’re **right** and **appropriately sure**, you get the **best** score. If you’re **wrong** but you **acted like a genius** — that’s the **worst** cell: we penalise **cocky mistakes** harder than cautious ones. That single design choice is what teaches ‘**honest uncertainty**’ instead of fake confidence." ### [1:25 – 1:40] “Multi-Agent Court Panel” — two lawyers in software **Visual:** First the **empty state** (“run contradictory claim to see…”). Switch dropdown to **contradictory claim**, **Run Episode**, scroll until **Court Panel Convened** — **Prosecutor (STRONG)** vs **Defender (WEAK)** and the **VERDICT** bar. **Say:** > "When the case is adversarial, the AI can **open a court** — not a gimmick, a **stress test**. One side argues *fraud from the evidence we found*; the other argues *innocent explanations still exist*. You see **strong** vs **weak** right on the card — then a **recommended action**. That’s how we stop one lazy headline from deciding someone’s claim." ### [1:40 – 1:55] Third scenario — humility pays **Visual:** **distribution shift claim** → **Run Episode** → trace with **`query_linked_claim`**, **`flag_fraud_signal`**, final **`escalate_to_human [CONF: LOW]`**. LIVE METRICS showing **LOW** confidence and a solid **reward** (e.g. ~0.7). **Say:** > "Last trick: the fraud hides in **links between claims** — same broker, same pattern. The winning move isn’t bragging — it’s **raising your hand**: *human needed, I’m only low confidence*. ClaimCourt **rewards that humility**. In the real world, that’s fewer multi-crore mistakes." --- ## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light **Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line. **Say (one breath, then slow on numbers):** > "We didn’t just draw a pretty UI. We ran the AI through **five thousand** practice claims on cloud GPUs. The training score — think ‘**overall lesson learned**’ — went from about **zero-point-one-three** to **zero-point-four-seven**. On held-out checks, **decision accuracy** and **calibration** both went from **zero to perfect one-point-zero**. Under the hood that’s **reinforcement learning** with Hugging Face’s **TRL** library — same family of tech behind recent open reasoning models. The details are in our **README** and **mini-blog** for anyone who wants to dig." **Numbers table (optional on-screen end card):** | What we measure | Before training | After training | |-----------------|-----------------|----------------| | “Lesson learned” score (mean reward) | **0.13** | **0.47** | | Decision matches the right action | **0%** | **100%** | | Confidence matches reality | **0%** | **100%** | | Catching fraud signals (partial credit) | **0%** | **33%** | --- ## ACT 4 — Close + try it (~15 s) **Visual:** Full-screen end card. Large QR optional. Cursor hovers each line. **Say:** > "If you work in risk, ops, or policy — or you’re just curious — **open ClaimCourt**, pick **contradictory claim**, hit **Run Episode**, and watch the trace and the court panel. If you’re a builder, everything is on **GitHub** under the codename **debatefloor** — links in the description. **Try one case** — that’s all it takes to see why this matters. Thank you." **Links (read slowly or show as text):** - **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor - **Code:** https://github.com/AniketAslaliya/debateFloor - **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md - **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl --- ## UI → script mapping (checklist so nothing is missing) | UI element | Act / time | Covered? | |------------|------------|----------| | ClaimCourt header + Space chrome | Act 1 | ✓ | | Run an Episode + **task dropdown** (3 tasks) | Act 2 | ✓ | | Task description + **Run Episode** + **CORRECT** | Act 2 | ✓ | | **Claim Under Investigation** (ID, claimant, docs) | Act 2 | ✓ | | **agent-trace.log** (steps, CONF tags) | Act 2 | ✓ | | **LIVE METRICS** (Reward, Calibration, Confidence, Steps) | Act 2 | ✓ | | **3×2 Calibration Matrix** | Act 2 | ✓ | | **Multi-Agent Court Panel** (empty + **live debate + verdict**) | Act 2 | ✓ | | **distribution_shift** + linked claims + **LOW** + reward | Act 2 | ✓ | | Training proof + numbers | Act 3 | ✓ | | Links + “try one case” CTA | Act 4 | ✓ | --- ## Optional segments (if you have +15 s) - **Split screen (technical viewers only):** Space left, `app/main.py` `/step` right — *“same server answers the demo and the training job.”* Skip for a general audience. - **JSON overlay (2 s):** tiny corner: request/response — proves it’s not canned video. --- ## Production checklist | # | Do this | Why | |---|---------|-----| | 1 | Rehearse **one full run** per task so clicks are smooth | Saves retakes | | 2 | **1080p**, clear browser zoom (~110%) | Readable on phones | | 3 | **Yellow cursor** in OBS | Viewers follow the story | | 4 | **No facecam** needed | Keeps focus on product | | 5 | Export **YouTube** as public URL; **no** huge video in HF repo | Matches hackathon rules | | 6 | **1.5 s** title card: `ClaimCourt — OpenEnv Hackathon India 2026` | Brand + context | --- ## Jargon one-liners (if you use a term, follow with this) | Term | One-liner for family & friends | |------|--------------------------------| | OpenEnv | “A standard way to package ‘**AI + environment + rules**’ so researchers can compare apples to apples.” | | GRPO / TRL | “**Practice + score + repeat** — like flight simulators for pilots, but for language models.” | | Reward | “**Did we like that behaviour?** — summed up as a number.” | | Calibration | “**Was its confidence honest** — not just lucky?” | --- ## Canonical stats (technical backup — same as repo JSON) Source: `reports/training_summary.json`, `reports/component_shift_summary.json` — Qwen2.5-0.5B-Instruct, 5k episodes, 2500 GRPO steps, ~3h on L4. | Metric | Before | After | |--------|--------|-------| | Mean training reward | 0.130 | 0.469 | | Decision accuracy (eval) | 0.00 | 1.00 | | Calibration (eval) | 0.00 | 1.00 | | Fraud detection (eval) | 0.00 | 0.33 | | Final train loss | — | ~0.00565 |