Spaces:

Timusgeorge
/

SynthAudit-Env

Sleeping

App Files Files Community

Timusgeorge commited on 15 days ago

Commit

3942264

verified ·

1 Parent(s): c204411

Deploy PITCH.md with all fixes

Browse files

Files changed (1) hide show

PITCH.md +58 -92

PITCH.md CHANGED Viewed

@@ -1,122 +1,88 @@
-# SynthAudit.Env — 3-Minute Pitch Script
-## OPENING (30 seconds)
-> "40,000 patients die every year from diagnostic errors. Now imagine deploying
-> an AI to help — and that AI hallucinates a protocol amendment that doesn't exist,
-> confidently clears a patient whose death date is BEFORE their treatment started,
-> and cites a fake clinical study to justify it.
 >
-> This is not hypothetical. These are the exact failure modes we see in frontier
-> LLMs today. The question is: **who audits the AI?**
->
-> I'm Sumit. I built SynthAudit.Env — the first OpenEnv environment where
-> an AI agent learns to catch another AI's medical mistakes."
 ---
-## THE PROBLEM (30 seconds)
-> "Current clinical AI oversight is manual. A human reviews every case.
-> That doesn't scale. When you have 80 patients in a trial and an AI
-> reviewing all of them, you need oversight at machine speed.
 >
-> But the hard part isn't detecting obvious errors. The hard part is
-> catching **subtle** errors — when the AI's reasoning SOUNDS perfectly
-> correct but is actually wrong."
-**[SHOW: Actor reasoning example on screen]**
-> "Look at this. The Actor AI says: 'Per Section 4.2.1(b) of the protocol
-> amendment, patients with ECOG ≤ 2 are eligible under expanded access.'
-> That section doesn't exist. It's a hallucination. But it sounds
-> extremely plausible."
 ---
-## THE SOLUTION (60 seconds)
-> "SynthAudit.Env is a multi-agent oversight environment built on OpenEnv.
->
-> There are two agents. The **Actor** — a frozen AI that reviews patients
-> and proposes diagnoses. Some are correct. Some contain subtle errors
-> injected by our adversarial engine.
->
-> The **Oversight Agent** — this is what we're training with GRPO —
-> has 8 tools to investigate. It can review proposals, pull raw patient
-> records, run SHAP feature attribution, do timeline audits, and
-> statistical cohort analysis.
->
-> Three things make this genuinely hard:"
-**[SHOW: Architecture diagram]**
-> "**One**: The Actor generates sophisticated medical reasoning. It anchors
-> on irrelevant features, cites fake studies, and applies rules to the
-> wrong context.
 >
-> **Two**: The hardest error requires 2-hop reasoning. Stage IV patients
-> get an extended treatment window — BUT if their comorbidity index exceeds
-> the threshold, that extension is revoked. The Actor ignores step 2.
-> No frontier LLM catches this consistently.
->
-> **Three**: Theory-of-Mind scoring. The agent doesn't just detect errors —
-> it must explain WHY the Actor was wrong. 'This looks suspicious' gets
-> less reward than 'The Actor applied the Stage IV exception but ignored
-> the comorbidity override clause.'"
 ---
 ## RESULTS (30 seconds)
-**[SHOW: Evaluation table + Reward curve]**
-> "Baseline results across 5 seeds:
-> - No-op agent: 0.01 average score
-> - Random agent: 0.05
-> - Smart heuristic with all 8 tools: 0.17
 >
-> After GRPO training with Llama 3.2 3B:
-> The reward curve rises from 0.28 to 0.71 over 20 episodes.
 >
-> The gap between the heuristic and training ceiling shows exactly
-> what reinforcement learning adds. Raw pattern matching can't
-> solve 2-hop reasoning — you need genuine agentic capability."
 ---
-## CLOSING (30 seconds)
-> "SynthAudit.Env contributes three things to the OpenEnv ecosystem:
->
-> **First**, a domain where oversight errors have real consequences —
-> patient safety, not benchmark scores.
->
-> **Second**, an adversarial Actor that tests genuine reasoning,
-> not just tool calling. Our templates simulate the exact failure
-> modes published in medical AI safety literature.
->
-> **Third**, a dense shaped reward model with F-beta scoring that
-> trains 10x faster than sparse rewards — critical for the 24-hour
-> hackathon format.
 >
-> The code is live on GitHub and HuggingFace. Every component is
-> built on TRL with Llama 3.2 — Meta-native, end to end.
 >
-> This is AI that watches AI. Thank you."
 ---
-## TIMER NOTES
-- 0:00–0:30 — Hook (the problem is visceral)
-- 0:30–1:00 — Problem statement
-- 1:00–2:00 — Architecture + what makes it hard
-- 2:00–2:30 — Results with numbers
-- 2:30–3:00 — Contributions + close
-## SCREEN SEQUENCE
-1. Opening: Actor hallucination example (terminal output)
-2. Architecture diagram from README
-3. Evaluation table (No-Op vs Random vs Heuristic)
-4. Reward curve (outputs/reward_curve.png)
-5. HuggingFace demo URL

+# SynthAudit.Env — 2-Minute Pitch
+## HOOK (15 seconds)
+> "A patient dies. Not because the doctor was wrong — because the AI
+> that screened them hallucinated a protocol amendment, cleared them
+> for treatment, and cited a fake clinical study to justify it.
 >
+> **40,000 patients die every year from diagnostic errors.**
+> I built the first environment where AI learns to catch AI mistakes."
 ---
+## WHAT IT IS (30 seconds)
+> "SynthAudit.Env. Two agents. One **Actor** — a frozen AI that
+> reviews patients and makes mistakes. Subtle ones. It cites fake
+> study sections. It ignores age violations. It applies the wrong rules.
 >
+> The **Oversight Agent** — trained with GRPO — gets 8 tools to
+> investigate and catch those mistakes. Review, investigate, flag, approve.
+>
+> The hardest test? **2-hop reasoning.** Stage IV patients get an
+> extended treatment window — BUT if comorbidity exceeds the threshold,
+> that extension is revoked. The Actor ignores step 2.
+> **No frontier LLM catches this consistently.**"
 ---
+## WHY IT'S HARD (20 seconds)
+> "Our environment is **intentionally brutal**. Hard mode has 14-17
+> hidden errors per episode. Even GPT-4 class models struggle.
 >
+> This isn't a toy benchmark with 90% accuracy. This is adversarial
+> clinical reasoning where a base model scores **0.04 out of 1.0**.
+> That's the point — if it was easy, you wouldn't need GRPO."
 ---
 ## RESULTS (30 seconds)
+**[SHOW: Comparison chart]**
+> "Free Colab T4. Zero dollars. Two hours.
+>
+> **283% improvement** over the untrained model.
+> **4 times more** clinical errors correctly caught.
+> Error detection jumped from **0.13 per episode to 0.53**.
 >
+> On a 3-billion parameter model. Intentionally small.
+> Because if a 3B model can learn clinical oversight on free hardware,
+> imagine what this environment teaches a 70B.
 >
+> **The environment is the contribution. The model proves it works.**"
 ---
+## CLOSE (15 seconds)
+> "SynthAudit.Env: 8 tools, 4 adversarial error types,
+> Theory-of-Mind scoring, dense shaped rewards, adaptive curriculum.
 >
+> **AI that watches AI. Zero dollars. Lives saved.**
 >
+> The code is on GitHub and HuggingFace. Thank you."
 ---
+## CHEAT SHEET (memorize these)
+| Number | What |
+|--------|------|
+| **40,000** | Deaths from diagnostic errors/year |
+| **283%** | Improvement over base model |
+| **4×** | More errors caught (2 → 8) |
+| **$0** | Compute cost |
+| **0.04 → 0.153** | Base → Trained score |
+| **0.506** | Peak training reward (step 157) |
+| **3B** | Model size (intentionally small) |
+| **200** | GRPO training steps |
+## SCREEN ORDER
+1. Hook → blank screen or logo
+2. Architecture diagram
+3. Base vs Trained comparison chart
+4. GRPO reward curve
+5. GitHub + HF links