Spaces:
Running
Running
| # ClaimCourt — Demo video script (non-technical + full UI) | |
| **Goal:** Someone with *no* ML background understands *what* ClaimCourt is, *why* it matters, sees *every major part* of your live Space, and wants to open the link. | |
| **Length:** ~1:55–2:00. Speak slowly; pause on numbers. | |
| **Brand on screen:** **ClaimCourt**. **URLs always use codename** `debatefloor` (unchanged links). | |
| **Public demo URL (paste when live):** _Add your YouTube or Loom link here, then copy it into the README table row **Demo walkthrough (video)** so judges have a one-click watch._ | |
| **Training proof (for one short segment):** 5,000 practice claims, reward **0.13 → 0.47**, held-out **calibration 0 → 1** and **decision accuracy 0 → 1** (see table at bottom). | |
| --- | |
| ## One-line premise (say this if you freeze) | |
| > "ClaimCourt is a **practice courtroom for AI on insurance claims** — it learns not just *what* to decide, but *how sure* it should be." | |
| --- | |
| ## ACT 1 — Why this exists (~25 s) | |
| ### [0:00 – 0:08] Hook — money + mistake everyone makes | |
| **Visual:** 2–3 full-screen title cards (no small text). Optional stock: busy hospital/claim desk silhouette. | |
| **Say:** | |
| > "India loses a staggering amount to insurance fraud every year — on the order of **eight to ten thousand crore rupees**. A lot of that isn’t cartoon villains — it’s **honest-looking paperwork** with something wrong underneath. The expensive mistake isn’t only *getting the answer wrong* — it’s being **sure** when you shouldn’t be. We built **ClaimCourt** so an AI can practice that skill." | |
| *(Optional lower-third once: source — BCG × Medi Assist style reports; keep it readable.)* | |
| ### [0:08 – 0:25] What ClaimCourt is — no jargon first | |
| **Visual:** Open **[ClaimCourt on Hugging Face](https://huggingface.co/spaces/AniketAsla/debatefloor)** full screen. Hero / top bar with **ClaimCourt** visible. | |
| **Say:** | |
| > "You’re looking at **ClaimCourt** — a **free, in-browser demo**. Pick a fake insurance case. Watch an AI **investigate** it like an analyst: read documents, spot red flags, sometimes call a **mini trial** with two opposing voices. At the end it must **approve**, **deny**, or **hand off to a human** — and say whether it’s **high**, **medium**, or **low** confidence. Same rules every time. You can try the next three examples yourself — link at the end." | |
| **Avoid until later:** “OpenEnv”, “GRPO”, “reward shaping” — introduce in Act 3 in one sentence each. | |
| --- | |
| ## ACT 2 — The product tour: every UI piece (~55 s) | |
| Use **one continuous screen recording** with a **yellow cursor ring**. Pause ~2–3 s on each labelled area below. | |
| ### [0:25 – 0:35] Left column — “Run an Episode” | |
| **Visual:** **Run an Episode** card. Open the **dropdown**: show all three: | |
| | Task (dropdown) | Plain-English pitch (say while hovering) | | |
| |-----------------|------------------------------------------| | |
| | **clean claim** | “Everything lines up — the honest answer is *approve*, and you should sound **sure**.” | | |
| | **contradictory claim** | “Documents **fight each other** — dates, costs, procedures don’t match. The AI should dig, then often **deny** — with **medium** confidence, not bravado.” | | |
| | **distribution shift claim** | “Looks normal **until** you pull in **linked** claims — shared brokers, patterns. Here the *right* move is often **hand to a human** and say **low** confidence — because the full picture is murky.” | | |
| **Say (short):** | |
| > "Three levels of difficulty — **easy**, **tricky**, and **‘looks fine until you connect the dots’**. Same button for all: **Run Episode**." | |
| Click **Run Episode** once on **clean claim** so the audience sees the flow start. | |
| ### [0:35 – 0:50] Middle — “Claim Under Investigation” | |
| **Visual:** Claim card: **ID**, **claimant name**, **incident** line, **document list** (DOC-1, DOC-2…). | |
| **Say:** | |
| > "Middle of the screen: the **fake claim file** — who it is, what happened, which PDFs exist. You’re not reading a research paper — you’re reading a **case file**. That’s deliberate: insurers think in cases, not equations." | |
| ### [0:50 – 1:05] Right — “agent-trace.log” (the story of the investigation) | |
| **Visual:** Scroll the **agent-trace.log** panel. Point at lines like `validate_document`, `flag_fraud_signal`, `convene_debate_panel`, final `approve_claim` / `deny_claim` / `escalate_to_human` with **`[CONF: HIGH]`** or **`[CONF: MED]`** or **`[CONF: LOW]`**. | |
| **Say:** | |
| > "Right side: a **plain-English diary** of what the AI did, step by step — not a black box. Each line is an action you could imagine a junior analyst taking: *check this document*, *flag this inconsistency*, *call for a second opinion*. That’s the transparency insurers actually need." | |
| ### [1:05 – 1:15] Bottom-left — “LIVE METRICS” | |
| **Visual:** **LIVE METRICS**: **Reward** (green number), **Calibration score**, **Declared confidence** pill (HIGH / MED / LOW), **Steps taken**. Optionally **CORRECT** badge when the outcome matches the scenario’s goal. | |
| **Say:** | |
| > "Numbers on the left aren’t magic scores for geeks — think of **reward** as ‘**did the behaviour we want just go up?**’ **Calibration** is ‘**did its confidence match reality?**’ High confidence on an easy honest claim — good. High confidence on a murky ring-fraud case — **bad**. The UI makes that visible in one glance." | |
| ### [1:15 – 1:25] “3×2 Calibration Matrix” — explain like a traffic light | |
| **Visual:** The **3×2 Calibration Matrix** card. Point at **HIGH + Correct = +1** (highlighted), then **HIGH + Wrong = −0.8** (red warning). | |
| **Say:** | |
| > "This little grid is the **rulebook for confidence**. If you’re **right** and **appropriately sure**, you get the **best** score. If you’re **wrong** but you **acted like a genius** — that’s the **worst** cell: we penalise **cocky mistakes** harder than cautious ones. That single design choice is what teaches ‘**honest uncertainty**’ instead of fake confidence." | |
| ### [1:25 – 1:40] “Multi-Agent Court Panel” — two lawyers in software | |
| **Visual:** First the **empty state** (“run contradictory claim to see…”). Switch dropdown to **contradictory claim**, **Run Episode**, scroll until **Court Panel Convened** — **Prosecutor (STRONG)** vs **Defender (WEAK)** and the **VERDICT** bar. | |
| **Say:** | |
| > "When the case is adversarial, the AI can **open a court** — not a gimmick, a **stress test**. One side argues *fraud from the evidence we found*; the other argues *innocent explanations still exist*. You see **strong** vs **weak** right on the card — then a **recommended action**. That’s how we stop one lazy headline from deciding someone’s claim." | |
| ### [1:40 – 1:55] Third scenario — humility pays | |
| **Visual:** **distribution shift claim** → **Run Episode** → trace with **`query_linked_claim`**, **`flag_fraud_signal`**, final **`escalate_to_human [CONF: LOW]`**. LIVE METRICS showing **LOW** confidence and a solid **reward** (e.g. ~0.7). | |
| **Say:** | |
| > "Last trick: the fraud hides in **links between claims** — same broker, same pattern. The winning move isn’t bragging — it’s **raising your hand**: *human needed, I’m only low confidence*. ClaimCourt **rewards that humility**. In the real world, that’s fewer multi-crore mistakes." | |
| --- | |
| ## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light | |
| **Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line. | |
| **Say (one breath, then slow on numbers):** | |
| > "We didn’t just draw a pretty UI. We ran the AI through **five thousand** practice claims on cloud GPUs. The training score — think ‘**overall lesson learned**’ — went from about **zero-point-one-three** to **zero-point-four-seven**. On held-out checks, **decision accuracy** and **calibration** both went from **zero to perfect one-point-zero**. Under the hood that’s **reinforcement learning** with Hugging Face’s **TRL** library — same family of tech behind recent open reasoning models. The details are in our **README** and **mini-blog** for anyone who wants to dig." | |
| **Numbers table (optional on-screen end card):** | |
| | What we measure | Before training | After training | | |
| |-----------------|-----------------|----------------| | |
| | “Lesson learned” score (mean reward) | **0.13** | **0.47** | | |
| | Decision matches the right action | **0%** | **100%** | | |
| | Confidence matches reality | **0%** | **100%** | | |
| | Catching fraud signals (partial credit) | **0%** | **33%** | | |
| --- | |
| ## ACT 4 — Close + try it (~15 s) | |
| **Visual:** Full-screen end card. Large QR optional. Cursor hovers each line. | |
| **Say:** | |
| > "If you work in risk, ops, or policy — or you’re just curious — **open ClaimCourt**, pick **contradictory claim**, hit **Run Episode**, and watch the trace and the court panel. If you’re a builder, everything is on **GitHub** under the codename **debatefloor** — links in the description. **Try one case** — that’s all it takes to see why this matters. Thank you." | |
| **Links (read slowly or show as text):** | |
| - **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor | |
| - **Code:** https://github.com/AniketAslaliya/debateFloor | |
| - **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md | |
| - **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl | |
| --- | |
| ## UI → script mapping (checklist so nothing is missing) | |
| | UI element | Act / time | Covered? | | |
| |------------|------------|----------| | |
| | ClaimCourt header + Space chrome | Act 1 | ✓ | | |
| | Run an Episode + **task dropdown** (3 tasks) | Act 2 | ✓ | | |
| | Task description + **Run Episode** + **CORRECT** | Act 2 | ✓ | | |
| | **Claim Under Investigation** (ID, claimant, docs) | Act 2 | ✓ | | |
| | **agent-trace.log** (steps, CONF tags) | Act 2 | ✓ | | |
| | **LIVE METRICS** (Reward, Calibration, Confidence, Steps) | Act 2 | ✓ | | |
| | **3×2 Calibration Matrix** | Act 2 | ✓ | | |
| | **Multi-Agent Court Panel** (empty + **live debate + verdict**) | Act 2 | ✓ | | |
| | **distribution_shift** + linked claims + **LOW** + reward | Act 2 | ✓ | | |
| | Training proof + numbers | Act 3 | ✓ | | |
| | Links + “try one case” CTA | Act 4 | ✓ | | |
| --- | |
| ## Optional segments (if you have +15 s) | |
| - **Split screen (technical viewers only):** Space left, `app/main.py` `/step` right — *“same server answers the demo and the training job.”* Skip for a general audience. | |
| - **JSON overlay (2 s):** tiny corner: request/response — proves it’s not canned video. | |
| --- | |
| ## Production checklist | |
| | # | Do this | Why | | |
| |---|---------|-----| | |
| | 1 | Rehearse **one full run** per task so clicks are smooth | Saves retakes | | |
| | 2 | **1080p**, clear browser zoom (~110%) | Readable on phones | | |
| | 3 | **Yellow cursor** in OBS | Viewers follow the story | | |
| | 4 | **No facecam** needed | Keeps focus on product | | |
| | 5 | Export **YouTube** as public URL; **no** huge video in HF repo | Matches hackathon rules | | |
| | 6 | **1.5 s** title card: `ClaimCourt — OpenEnv Hackathon India 2026` | Brand + context | | |
| --- | |
| ## Jargon one-liners (if you use a term, follow with this) | |
| | Term | One-liner for family & friends | | |
| |------|--------------------------------| | |
| | OpenEnv | “A standard way to package ‘**AI + environment + rules**’ so researchers can compare apples to apples.” | | |
| | GRPO / TRL | “**Practice + score + repeat** — like flight simulators for pilots, but for language models.” | | |
| | Reward | “**Did we like that behaviour?** — summed up as a number.” | | |
| | Calibration | “**Was its confidence honest** — not just lucky?” | | |
| --- | |
| ## Canonical stats (technical backup — same as repo JSON) | |
| Source: `reports/training_summary.json`, `reports/component_shift_summary.json` — Qwen2.5-0.5B-Instruct, 5k episodes, 2500 GRPO steps, ~3h on L4. | |
| | Metric | Before | After | | |
| |--------|--------|-------| | |
| | Mean training reward | 0.130 | 0.469 | | |
| | Decision accuracy (eval) | 0.00 | 1.00 | | |
| | Calibration (eval) | 0.00 | 1.00 | | |
| | Fraud detection (eval) | 0.00 | 0.33 | | |
| | Final train loss | — | ~0.00565 | | |