Spaces:
Running
Running
File size: 12,424 Bytes
4b60a86 7b42b28 4b60a86 e47b2da 4b60a86 fc09821 4b60a86 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | # ClaimCourt — Demo video script (non-technical + full UI)
**Goal:** Someone with *no* ML background understands *what* ClaimCourt is, *why* it matters, sees *every major part* of your live Space, and wants to open the link.
**Length:** ~1:55–2:00. Speak slowly; pause on numbers.
**Brand on screen:** **ClaimCourt**. **URLs always use codename** `debatefloor` (unchanged links).
**Public demo URL (paste when live):** _Add your YouTube or Loom link here, then copy it into the README table row **Demo walkthrough (video)** so judges have a one-click watch._
**Training proof (for one short segment):** 5,000 practice claims, reward **0.13 → 0.47**, held-out **calibration 0 → 1** and **decision accuracy 0 → 1** (see table at bottom).
---
## One-line premise (say this if you freeze)
> "ClaimCourt is a **practice courtroom for AI on insurance claims** — it learns not just *what* to decide, but *how sure* it should be."
---
## ACT 1 — Why this exists (~25 s)
### [0:00 – 0:08] Hook — money + mistake everyone makes
**Visual:** 2–3 full-screen title cards (no small text). Optional stock: busy hospital/claim desk silhouette.
**Say:**
> "India loses a staggering amount to insurance fraud every year — on the order of **eight to ten thousand crore rupees**. A lot of that isn’t cartoon villains — it’s **honest-looking paperwork** with something wrong underneath. The expensive mistake isn’t only *getting the answer wrong* — it’s being **sure** when you shouldn’t be. We built **ClaimCourt** so an AI can practice that skill."
*(Optional lower-third once: source — BCG × Medi Assist style reports; keep it readable.)*
### [0:08 – 0:25] What ClaimCourt is — no jargon first
**Visual:** Open **[ClaimCourt on Hugging Face](https://huggingface.co/spaces/AniketAsla/debatefloor)** full screen. Hero / top bar with **ClaimCourt** visible.
**Say:**
> "You’re looking at **ClaimCourt** — a **free, in-browser demo**. Pick a fake insurance case. Watch an AI **investigate** it like an analyst: read documents, spot red flags, sometimes call a **mini trial** with two opposing voices. At the end it must **approve**, **deny**, or **hand off to a human** — and say whether it’s **high**, **medium**, or **low** confidence. Same rules every time. You can try the next three examples yourself — link at the end."
**Avoid until later:** “OpenEnv”, “GRPO”, “reward shaping” — introduce in Act 3 in one sentence each.
---
## ACT 2 — The product tour: every UI piece (~55 s)
Use **one continuous screen recording** with a **yellow cursor ring**. Pause ~2–3 s on each labelled area below.
### [0:25 – 0:35] Left column — “Run an Episode”
**Visual:** **Run an Episode** card. Open the **dropdown**: show all three:
| Task (dropdown) | Plain-English pitch (say while hovering) |
|-----------------|------------------------------------------|
| **clean claim** | “Everything lines up — the honest answer is *approve*, and you should sound **sure**.” |
| **contradictory claim** | “Documents **fight each other** — dates, costs, procedures don’t match. The AI should dig, then often **deny** — with **medium** confidence, not bravado.” |
| **distribution shift claim** | “Looks normal **until** you pull in **linked** claims — shared brokers, patterns. Here the *right* move is often **hand to a human** and say **low** confidence — because the full picture is murky.” |
**Say (short):**
> "Three levels of difficulty — **easy**, **tricky**, and **‘looks fine until you connect the dots’**. Same button for all: **Run Episode**."
Click **Run Episode** once on **clean claim** so the audience sees the flow start.
### [0:35 – 0:50] Middle — “Claim Under Investigation”
**Visual:** Claim card: **ID**, **claimant name**, **incident** line, **document list** (DOC-1, DOC-2…).
**Say:**
> "Middle of the screen: the **fake claim file** — who it is, what happened, which PDFs exist. You’re not reading a research paper — you’re reading a **case file**. That’s deliberate: insurers think in cases, not equations."
### [0:50 – 1:05] Right — “agent-trace.log” (the story of the investigation)
**Visual:** Scroll the **agent-trace.log** panel. Point at lines like `validate_document`, `flag_fraud_signal`, `convene_debate_panel`, final `approve_claim` / `deny_claim` / `escalate_to_human` with **`[CONF: HIGH]`** or **`[CONF: MED]`** or **`[CONF: LOW]`**.
**Say:**
> "Right side: a **plain-English diary** of what the AI did, step by step — not a black box. Each line is an action you could imagine a junior analyst taking: *check this document*, *flag this inconsistency*, *call for a second opinion*. That’s the transparency insurers actually need."
### [1:05 – 1:15] Bottom-left — “LIVE METRICS”
**Visual:** **LIVE METRICS**: **Reward** (green number), **Calibration score**, **Declared confidence** pill (HIGH / MED / LOW), **Steps taken**. Optionally **CORRECT** badge when the outcome matches the scenario’s goal.
**Say:**
> "Numbers on the left aren’t magic scores for geeks — think of **reward** as ‘**did the behaviour we want just go up?**’ **Calibration** is ‘**did its confidence match reality?**’ High confidence on an easy honest claim — good. High confidence on a murky ring-fraud case — **bad**. The UI makes that visible in one glance."
### [1:15 – 1:25] “3×2 Calibration Matrix” — explain like a traffic light
**Visual:** The **3×2 Calibration Matrix** card. Point at **HIGH + Correct = +1** (highlighted), then **HIGH + Wrong = −0.8** (red warning).
**Say:**
> "This little grid is the **rulebook for confidence**. If you’re **right** and **appropriately sure**, you get the **best** score. If you’re **wrong** but you **acted like a genius** — that’s the **worst** cell: we penalise **cocky mistakes** harder than cautious ones. That single design choice is what teaches ‘**honest uncertainty**’ instead of fake confidence."
### [1:25 – 1:40] “Multi-Agent Court Panel” — two lawyers in software
**Visual:** First the **empty state** (“run contradictory claim to see…”). Switch dropdown to **contradictory claim**, **Run Episode**, scroll until **Court Panel Convened** — **Prosecutor (STRONG)** vs **Defender (WEAK)** and the **VERDICT** bar.
**Say:**
> "When the case is adversarial, the AI can **open a court** — not a gimmick, a **stress test**. One side argues *fraud from the evidence we found*; the other argues *innocent explanations still exist*. You see **strong** vs **weak** right on the card — then a **recommended action**. That’s how we stop one lazy headline from deciding someone’s claim."
### [1:40 – 1:55] Third scenario — humility pays
**Visual:** **distribution shift claim** → **Run Episode** → trace with **`query_linked_claim`**, **`flag_fraud_signal`**, final **`escalate_to_human [CONF: LOW]`**. LIVE METRICS showing **LOW** confidence and a solid **reward** (e.g. ~0.7).
**Say:**
> "Last trick: the fraud hides in **links between claims** — same broker, same pattern. The winning move isn’t bragging — it’s **raising your hand**: *human needed, I’m only low confidence*. ClaimCourt **rewards that humility**. In the real world, that’s fewer multi-crore mistakes."
---
## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
**Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line.
**Say (one breath, then slow on numbers):**
> "We didn’t just draw a pretty UI. We ran the AI through **five thousand** practice claims on cloud GPUs. The training score — think ‘**overall lesson learned**’ — went from about **zero-point-one-three** to **zero-point-four-seven**. On held-out checks, **decision accuracy** and **calibration** both went from **zero to perfect one-point-zero**. Under the hood that’s **reinforcement learning** with Hugging Face’s **TRL** library — same family of tech behind recent open reasoning models. The details are in our **README** and **mini-blog** for anyone who wants to dig."
**Numbers table (optional on-screen end card):**
| What we measure | Before training | After training |
|-----------------|-----------------|----------------|
| “Lesson learned” score (mean reward) | **0.13** | **0.47** |
| Decision matches the right action | **0%** | **100%** |
| Confidence matches reality | **0%** | **100%** |
| Catching fraud signals (partial credit) | **0%** | **33%** |
---
## ACT 4 — Close + try it (~15 s)
**Visual:** Full-screen end card. Large QR optional. Cursor hovers each line.
**Say:**
> "If you work in risk, ops, or policy — or you’re just curious — **open ClaimCourt**, pick **contradictory claim**, hit **Run Episode**, and watch the trace and the court panel. If you’re a builder, everything is on **GitHub** under the codename **debatefloor** — links in the description. **Try one case** — that’s all it takes to see why this matters. Thank you."
**Links (read slowly or show as text):**
- **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
- **Code:** https://github.com/AniketAslaliya/debateFloor
- **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md
- **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
---
## UI → script mapping (checklist so nothing is missing)
| UI element | Act / time | Covered? |
|------------|------------|----------|
| ClaimCourt header + Space chrome | Act 1 | ✓ |
| Run an Episode + **task dropdown** (3 tasks) | Act 2 | ✓ |
| Task description + **Run Episode** + **CORRECT** | Act 2 | ✓ |
| **Claim Under Investigation** (ID, claimant, docs) | Act 2 | ✓ |
| **agent-trace.log** (steps, CONF tags) | Act 2 | ✓ |
| **LIVE METRICS** (Reward, Calibration, Confidence, Steps) | Act 2 | ✓ |
| **3×2 Calibration Matrix** | Act 2 | ✓ |
| **Multi-Agent Court Panel** (empty + **live debate + verdict**) | Act 2 | ✓ |
| **distribution_shift** + linked claims + **LOW** + reward | Act 2 | ✓ |
| Training proof + numbers | Act 3 | ✓ |
| Links + “try one case” CTA | Act 4 | ✓ |
---
## Optional segments (if you have +15 s)
- **Split screen (technical viewers only):** Space left, `app/main.py` `/step` right — *“same server answers the demo and the training job.”* Skip for a general audience.
- **JSON overlay (2 s):** tiny corner: request/response — proves it’s not canned video.
---
## Production checklist
| # | Do this | Why |
|---|---------|-----|
| 1 | Rehearse **one full run** per task so clicks are smooth | Saves retakes |
| 2 | **1080p**, clear browser zoom (~110%) | Readable on phones |
| 3 | **Yellow cursor** in OBS | Viewers follow the story |
| 4 | **No facecam** needed | Keeps focus on product |
| 5 | Export **YouTube** as public URL; **no** huge video in HF repo | Matches hackathon rules |
| 6 | **1.5 s** title card: `ClaimCourt — OpenEnv Hackathon India 2026` | Brand + context |
---
## Jargon one-liners (if you use a term, follow with this)
| Term | One-liner for family & friends |
|------|--------------------------------|
| OpenEnv | “A standard way to package ‘**AI + environment + rules**’ so researchers can compare apples to apples.” |
| GRPO / TRL | “**Practice + score + repeat** — like flight simulators for pilots, but for language models.” |
| Reward | “**Did we like that behaviour?** — summed up as a number.” |
| Calibration | “**Was its confidence honest** — not just lucky?” |
---
## Canonical stats (technical backup — same as repo JSON)
Source: `reports/training_summary.json`, `reports/component_shift_summary.json` — Qwen2.5-0.5B-Instruct, 5k episodes, 2500 GRPO steps, ~3h on L4.
| Metric | Before | After |
|--------|--------|-------|
| Mean training reward | 0.130 | 0.469 |
| Decision accuracy (eval) | 0.00 | 1.00 |
| Calibration (eval) | 0.00 | 1.00 |
| Fraud detection (eval) | 0.00 | 0.33 |
| Final train loss | — | ~0.00565 |
|