Spaces:

AniketAsla
/

debatefloor

Running

App Files Files Community

debatefloor / docs /VideoScript_ClaimCourt.md

AniketAsla

sync: git 5dd32e4 (5dd32e4d821f0b6b1b3ed6297b85b15934752d0a)

7b42b28 verified 12 days ago

preview code

raw

history blame contribute delete

12.4 kB

	# ClaimCourt — Demo video script (non-technical + full UI)

	Goal: Someone with no ML background understands what ClaimCourt is, why it matters, sees every major part of your live Space, and wants to open the link.
	Length: ~1:55–2:00. Speak slowly; pause on numbers.
	Brand on screen: ClaimCourt. URLs always use codename `debatefloor` (unchanged links).

	Public demo URL (paste when live): _Add your YouTube or Loom link here, then copy it into the README table row Demo walkthrough (video) so judges have a one-click watch._

	Training proof (for one short segment): 5,000 practice claims, reward 0.13 → 0.47, held-out calibration 0 → 1 and decision accuracy 0 → 1 (see table at bottom).

	---

	## One-line premise (say this if you freeze)

	> "ClaimCourt is a practice courtroom for AI on insurance claims — it learns not just what to decide, but how sure it should be."

	---

	## ACT 1 — Why this exists (~25 s)

	### [0:00 – 0:08] Hook — money + mistake everyone makes

	Visual: 2–3 full-screen title cards (no small text). Optional stock: busy hospital/claim desk silhouette.

	Say:

	> "India loses a staggering amount to insurance fraud every year — on the order of eight to ten thousand crore rupees. A lot of that isn’t cartoon villains — it’s honest-looking paperwork with something wrong underneath. The expensive mistake isn’t only getting the answer wrong — it’s being sure when you shouldn’t be. We built ClaimCourt so an AI can practice that skill."

	(Optional lower-third once: source — BCG × Medi Assist style reports; keep it readable.)

	### [0:08 – 0:25] What ClaimCourt is — no jargon first

	Visual: Open [ClaimCourt on Hugging Face](https://huggingface.co/spaces/AniketAsla/debatefloor) full screen. Hero / top bar with ClaimCourt visible.

	Say:

	> "You’re looking at ClaimCourt — a free, in-browser demo. Pick a fake insurance case. Watch an AI investigate it like an analyst: read documents, spot red flags, sometimes call a mini trial with two opposing voices. At the end it must approve, deny, or hand off to a human — and say whether it’s high, medium, or low confidence. Same rules every time. You can try the next three examples yourself — link at the end."

	Avoid until later: “OpenEnv”, “GRPO”, “reward shaping” — introduce in Act 3 in one sentence each.

	---

	## ACT 2 — The product tour: every UI piece (~55 s)

	Use one continuous screen recording with a yellow cursor ring. Pause ~2–3 s on each labelled area below.

	### [0:25 – 0:35] Left column — “Run an Episode”

	Visual: Run an Episode card. Open the dropdown: show all three:

	\| Task (dropdown) \| Plain-English pitch (say while hovering) \|
	\|-----------------\|------------------------------------------\|
	\| clean claim \| “Everything lines up — the honest answer is approve, and you should sound sure.” \|
	\| contradictory claim \| “Documents fight each other — dates, costs, procedures don’t match. The AI should dig, then often deny — with medium confidence, not bravado.” \|
	\| distribution shift claim \| “Looks normal until you pull in linked claims — shared brokers, patterns. Here the right move is often hand to a human and say low confidence — because the full picture is murky.” \|

	Say (short):

	> "Three levels of difficulty — easy, tricky, and ‘looks fine until you connect the dots’. Same button for all: Run Episode."

	Click Run Episode once on clean claim so the audience sees the flow start.

	### [0:35 – 0:50] Middle — “Claim Under Investigation”

	Visual: Claim card: ID, claimant name, incident line, document list (DOC-1, DOC-2…).

	Say:

	> "Middle of the screen: the fake claim file — who it is, what happened, which PDFs exist. You’re not reading a research paper — you’re reading a case file. That’s deliberate: insurers think in cases, not equations."

	### [0:50 – 1:05] Right — “agent-trace.log” (the story of the investigation)

	Visual: Scroll the agent-trace.log panel. Point at lines like `validate_document`, `flag_fraud_signal`, `convene_debate_panel`, final `approve_claim` / `deny_claim` / `escalate_to_human` with `[CONF: HIGH]` or `[CONF: MED]` or `[CONF: LOW]`.

	Say:

	> "Right side: a plain-English diary of what the AI did, step by step — not a black box. Each line is an action you could imagine a junior analyst taking: check this document, flag this inconsistency, call for a second opinion. That’s the transparency insurers actually need."

	### [1:05 – 1:15] Bottom-left — “LIVE METRICS”

	Visual: LIVE METRICS: Reward (green number), Calibration score, Declared confidence pill (HIGH / MED / LOW), Steps taken. Optionally CORRECT badge when the outcome matches the scenario’s goal.

	Say:

	> "Numbers on the left aren’t magic scores for geeks — think of reward as ‘did the behaviour we want just go up?’ Calibration is ‘did its confidence match reality?’ High confidence on an easy honest claim — good. High confidence on a murky ring-fraud case — bad. The UI makes that visible in one glance."

	### [1:15 – 1:25] “3×2 Calibration Matrix” — explain like a traffic light

	Visual: The 3×2 Calibration Matrix card. Point at HIGH + Correct = +1 (highlighted), then HIGH + Wrong = −0.8 (red warning).

	Say:

	> "This little grid is the rulebook for confidence. If you’re right and appropriately sure, you get the best score. If you’re wrong but you acted like a genius — that’s the worst cell: we penalise cocky mistakes harder than cautious ones. That single design choice is what teaches ‘honest uncertainty’ instead of fake confidence."

	### [1:25 – 1:40] “Multi-Agent Court Panel” — two lawyers in software

	Visual: First the empty state (“run contradictory claim to see…”). Switch dropdown to contradictory claim, Run Episode, scroll until Court Panel Convened — Prosecutor (STRONG) vs Defender (WEAK) and the VERDICT bar.

	Say:

	> "When the case is adversarial, the AI can open a court — not a gimmick, a stress test. One side argues fraud from the evidence we found; the other argues innocent explanations still exist. You see strong vs weak right on the card — then a recommended action. That’s how we stop one lazy headline from deciding someone’s claim."

	### [1:40 – 1:55] Third scenario — humility pays

	Visual: distribution shift claim → Run Episode → trace with `query_linked_claim`, `flag_fraud_signal`, final `escalate_to_human [CONF: LOW]`. LIVE METRICS showing LOW confidence and a solid reward (e.g. ~0.7).

	Say:

	> "Last trick: the fraud hides in links between claims — same broker, same pattern. The winning move isn’t bragging — it’s raising your hand: human needed, I’m only low confidence. ClaimCourt rewards that humility. In the real world, that’s fewer multi-crore mistakes."

	---

	## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light

	Visual: Quick montage: WandB project page (reward climbing) or `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line.

	Say (one breath, then slow on numbers):

	> "We didn’t just draw a pretty UI. We ran the AI through five thousand practice claims on cloud GPUs. The training score — think ‘overall lesson learned’ — went from about zero-point-one-three to zero-point-four-seven. On held-out checks, decision accuracy and calibration both went from zero to perfect one-point-zero. Under the hood that’s reinforcement learning with Hugging Face’s TRL library — same family of tech behind recent open reasoning models. The details are in our README and mini-blog for anyone who wants to dig."

	Numbers table (optional on-screen end card):

	\| What we measure \| Before training \| After training \|
	\|-----------------\|-----------------\|----------------\|
	\| “Lesson learned” score (mean reward) \| 0.13 \| 0.47 \|
	\| Decision matches the right action \| 0% \| 100% \|
	\| Confidence matches reality \| 0% \| 100% \|
	\| Catching fraud signals (partial credit) \| 0% \| 33% \|

	---

	## ACT 4 — Close + try it (~15 s)

	Visual: Full-screen end card. Large QR optional. Cursor hovers each line.

	Say:

	> "If you work in risk, ops, or policy — or you’re just curious — open ClaimCourt, pick contradictory claim, hit Run Episode, and watch the trace and the court panel. If you’re a builder, everything is on GitHub under the codename debatefloor — links in the description. Try one case — that’s all it takes to see why this matters. Thank you."

	Links (read slowly or show as text):

	- Try it: https://huggingface.co/spaces/AniketAsla/debatefloor
	- Code: https://github.com/AniketAslaliya/debateFloor
	- Mini-blog (markdown): https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md
	- Weights & Biases (all training runs): https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl

	---

	## UI → script mapping (checklist so nothing is missing)

	\| UI element \| Act / time \| Covered? \|
	\|------------\|------------\|----------\|
	\| ClaimCourt header + Space chrome \| Act 1 \| ✓ \|
	\| Run an Episode + task dropdown (3 tasks) \| Act 2 \| ✓ \|
	\| Task description + Run Episode + CORRECT \| Act 2 \| ✓ \|
	\| Claim Under Investigation (ID, claimant, docs) \| Act 2 \| ✓ \|
	\| agent-trace.log (steps, CONF tags) \| Act 2 \| ✓ \|
	\| LIVE METRICS (Reward, Calibration, Confidence, Steps) \| Act 2 \| ✓ \|
	\| 3×2 Calibration Matrix \| Act 2 \| ✓ \|
	\| Multi-Agent Court Panel (empty + live debate + verdict) \| Act 2 \| ✓ \|
	\| distribution_shift + linked claims + LOW + reward \| Act 2 \| ✓ \|
	\| Training proof + numbers \| Act 3 \| ✓ \|
	\| Links + “try one case” CTA \| Act 4 \| ✓ \|

	---

	## Optional segments (if you have +15 s)

	- Split screen (technical viewers only): Space left, `app/main.py` `/step` right — “same server answers the demo and the training job.” Skip for a general audience.
	- JSON overlay (2 s): tiny corner: request/response — proves it’s not canned video.

	---

	## Production checklist

	\| # \| Do this \| Why \|
	\|---\|---------\|-----\|
	\| 1 \| Rehearse one full run per task so clicks are smooth \| Saves retakes \|
	\| 2 \| 1080p, clear browser zoom (~110%) \| Readable on phones \|
	\| 3 \| Yellow cursor in OBS \| Viewers follow the story \|
	\| 4 \| No facecam needed \| Keeps focus on product \|
	\| 5 \| Export YouTube as public URL; no huge video in HF repo \| Matches hackathon rules \|
	\| 6 \| 1.5 s title card: `ClaimCourt — OpenEnv Hackathon India 2026` \| Brand + context \|

	---

	## Jargon one-liners (if you use a term, follow with this)

	\| Term \| One-liner for family & friends \|
	\|------\|--------------------------------\|
	\| OpenEnv \| “A standard way to package ‘AI + environment + rules’ so researchers can compare apples to apples.” \|
	\| GRPO / TRL \| “Practice + score + repeat — like flight simulators for pilots, but for language models.” \|
	\| Reward \| “Did we like that behaviour? — summed up as a number.” \|
	\| Calibration \| “Was its confidence honest — not just lucky?” \|

	---

	## Canonical stats (technical backup — same as repo JSON)

	Source: `reports/training_summary.json`, `reports/component_shift_summary.json` — Qwen2.5-0.5B-Instruct, 5k episodes, 2500 GRPO steps, ~3h on L4.

	\| Metric \| Before \| After \|
	\|--------\|--------\|-------\|
	\| Mean training reward \| 0.130 \| 0.469 \|
	\| Decision accuracy (eval) \| 0.00 \| 1.00 \|
	\| Calibration (eval) \| 0.00 \| 1.00 \|
	\| Fraud detection (eval) \| 0.00 \| 0.33 \|
	\| Final train loss \| — \| ~0.00565 \|