Spaces:
Sleeping
Sleeping
Deploy PITCH.md with all fixes
Browse files
PITCH.md
CHANGED
|
@@ -1,122 +1,88 @@
|
|
| 1 |
-
# SynthAudit.Env β
|
| 2 |
|
| 3 |
-
##
|
| 4 |
|
| 5 |
-
> "
|
| 6 |
-
>
|
| 7 |
-
>
|
| 8 |
-
> and cites a fake clinical study to justify it.
|
| 9 |
>
|
| 10 |
-
>
|
| 11 |
-
>
|
| 12 |
-
>
|
| 13 |
-
> I'm Sumit. I built SynthAudit.Env β the first OpenEnv environment where
|
| 14 |
-
> an AI agent learns to catch another AI's medical mistakes."
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
-
##
|
| 19 |
|
| 20 |
-
> "
|
| 21 |
-
>
|
| 22 |
-
>
|
| 23 |
>
|
| 24 |
-
>
|
| 25 |
-
>
|
| 26 |
-
>
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
>
|
| 31 |
-
> amendment, patients with ECOG β€ 2 are eligible under expanded access.'
|
| 32 |
-
> That section doesn't exist. It's a hallucination. But it sounds
|
| 33 |
-
> extremely plausible."
|
| 34 |
|
| 35 |
---
|
| 36 |
|
| 37 |
-
##
|
| 38 |
-
|
| 39 |
-
> "SynthAudit.Env is a multi-agent oversight environment built on OpenEnv.
|
| 40 |
-
>
|
| 41 |
-
> There are two agents. The **Actor** β a frozen AI that reviews patients
|
| 42 |
-
> and proposes diagnoses. Some are correct. Some contain subtle errors
|
| 43 |
-
> injected by our adversarial engine.
|
| 44 |
-
>
|
| 45 |
-
> The **Oversight Agent** β this is what we're training with GRPO β
|
| 46 |
-
> has 8 tools to investigate. It can review proposals, pull raw patient
|
| 47 |
-
> records, run SHAP feature attribution, do timeline audits, and
|
| 48 |
-
> statistical cohort analysis.
|
| 49 |
-
>
|
| 50 |
-
> Three things make this genuinely hard:"
|
| 51 |
-
|
| 52 |
-
**[SHOW: Architecture diagram]**
|
| 53 |
|
| 54 |
-
> "**
|
| 55 |
-
>
|
| 56 |
-
> wrong context.
|
| 57 |
>
|
| 58 |
-
>
|
| 59 |
-
>
|
| 60 |
-
> the
|
| 61 |
-
> No frontier LLM catches this consistently.
|
| 62 |
-
>
|
| 63 |
-
> **Three**: Theory-of-Mind scoring. The agent doesn't just detect errors β
|
| 64 |
-
> it must explain WHY the Actor was wrong. 'This looks suspicious' gets
|
| 65 |
-
> less reward than 'The Actor applied the Stage IV exception but ignored
|
| 66 |
-
> the comorbidity override clause.'"
|
| 67 |
|
| 68 |
---
|
| 69 |
|
| 70 |
## RESULTS (30 seconds)
|
| 71 |
|
| 72 |
-
**[SHOW:
|
| 73 |
|
| 74 |
-
> "
|
| 75 |
-
>
|
| 76 |
-
>
|
| 77 |
-
>
|
|
|
|
| 78 |
>
|
| 79 |
-
>
|
| 80 |
-
>
|
|
|
|
| 81 |
>
|
| 82 |
-
> The
|
| 83 |
-
> what reinforcement learning adds. Raw pattern matching can't
|
| 84 |
-
> solve 2-hop reasoning β you need genuine agentic capability."
|
| 85 |
|
| 86 |
---
|
| 87 |
|
| 88 |
-
##
|
| 89 |
|
| 90 |
-
> "SynthAudit.Env
|
| 91 |
-
>
|
| 92 |
-
> **First**, a domain where oversight errors have real consequences β
|
| 93 |
-
> patient safety, not benchmark scores.
|
| 94 |
-
>
|
| 95 |
-
> **Second**, an adversarial Actor that tests genuine reasoning,
|
| 96 |
-
> not just tool calling. Our templates simulate the exact failure
|
| 97 |
-
> modes published in medical AI safety literature.
|
| 98 |
-
>
|
| 99 |
-
> **Third**, a dense shaped reward model with F-beta scoring that
|
| 100 |
-
> trains 10x faster than sparse rewards β critical for the 24-hour
|
| 101 |
-
> hackathon format.
|
| 102 |
>
|
| 103 |
-
>
|
| 104 |
-
> built on TRL with Llama 3.2 β Meta-native, end to end.
|
| 105 |
>
|
| 106 |
-
>
|
| 107 |
|
| 108 |
---
|
| 109 |
|
| 110 |
-
##
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SynthAudit.Env β 2-Minute Pitch
|
| 2 |
|
| 3 |
+
## HOOK (15 seconds)
|
| 4 |
|
| 5 |
+
> "A patient dies. Not because the doctor was wrong β because the AI
|
| 6 |
+
> that screened them hallucinated a protocol amendment, cleared them
|
| 7 |
+
> for treatment, and cited a fake clinical study to justify it.
|
|
|
|
| 8 |
>
|
| 9 |
+
> **40,000 patients die every year from diagnostic errors.**
|
| 10 |
+
> I built the first environment where AI learns to catch AI mistakes."
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
---
|
| 13 |
|
| 14 |
+
## WHAT IT IS (30 seconds)
|
| 15 |
|
| 16 |
+
> "SynthAudit.Env. Two agents. One **Actor** β a frozen AI that
|
| 17 |
+
> reviews patients and makes mistakes. Subtle ones. It cites fake
|
| 18 |
+
> study sections. It ignores age violations. It applies the wrong rules.
|
| 19 |
>
|
| 20 |
+
> The **Oversight Agent** β trained with GRPO β gets 8 tools to
|
| 21 |
+
> investigate and catch those mistakes. Review, investigate, flag, approve.
|
| 22 |
+
>
|
| 23 |
+
> The hardest test? **2-hop reasoning.** Stage IV patients get an
|
| 24 |
+
> extended treatment window β BUT if comorbidity exceeds the threshold,
|
| 25 |
+
> that extension is revoked. The Actor ignores step 2.
|
| 26 |
+
> **No frontier LLM catches this consistently.**"
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
+
## WHY IT'S HARD (20 seconds)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
> "Our environment is **intentionally brutal**. Hard mode has 14-17
|
| 33 |
+
> hidden errors per episode. Even GPT-4 class models struggle.
|
|
|
|
| 34 |
>
|
| 35 |
+
> This isn't a toy benchmark with 90% accuracy. This is adversarial
|
| 36 |
+
> clinical reasoning where a base model scores **0.04 out of 1.0**.
|
| 37 |
+
> That's the point β if it was easy, you wouldn't need GRPO."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
## RESULTS (30 seconds)
|
| 42 |
|
| 43 |
+
**[SHOW: Comparison chart]**
|
| 44 |
|
| 45 |
+
> "Free Colab T4. Zero dollars. Two hours.
|
| 46 |
+
>
|
| 47 |
+
> **283% improvement** over the untrained model.
|
| 48 |
+
> **4 times more** clinical errors correctly caught.
|
| 49 |
+
> Error detection jumped from **0.13 per episode to 0.53**.
|
| 50 |
>
|
| 51 |
+
> On a 3-billion parameter model. Intentionally small.
|
| 52 |
+
> Because if a 3B model can learn clinical oversight on free hardware,
|
| 53 |
+
> imagine what this environment teaches a 70B.
|
| 54 |
>
|
| 55 |
+
> **The environment is the contribution. The model proves it works.**"
|
|
|
|
|
|
|
| 56 |
|
| 57 |
---
|
| 58 |
|
| 59 |
+
## CLOSE (15 seconds)
|
| 60 |
|
| 61 |
+
> "SynthAudit.Env: 8 tools, 4 adversarial error types,
|
| 62 |
+
> Theory-of-Mind scoring, dense shaped rewards, adaptive curriculum.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
>
|
| 64 |
+
> **AI that watches AI. Zero dollars. Lives saved.**
|
|
|
|
| 65 |
>
|
| 66 |
+
> The code is on GitHub and HuggingFace. Thank you."
|
| 67 |
|
| 68 |
---
|
| 69 |
|
| 70 |
+
## CHEAT SHEET (memorize these)
|
| 71 |
+
|
| 72 |
+
| Number | What |
|
| 73 |
+
|--------|------|
|
| 74 |
+
| **40,000** | Deaths from diagnostic errors/year |
|
| 75 |
+
| **283%** | Improvement over base model |
|
| 76 |
+
| **4Γ** | More errors caught (2 β 8) |
|
| 77 |
+
| **$0** | Compute cost |
|
| 78 |
+
| **0.04 β 0.153** | Base β Trained score |
|
| 79 |
+
| **0.506** | Peak training reward (step 157) |
|
| 80 |
+
| **3B** | Model size (intentionally small) |
|
| 81 |
+
| **200** | GRPO training steps |
|
| 82 |
+
|
| 83 |
+
## SCREEN ORDER
|
| 84 |
+
1. Hook β blank screen or logo
|
| 85 |
+
2. Architecture diagram
|
| 86 |
+
3. Base vs Trained comparison chart
|
| 87 |
+
4. GRPO reward curve
|
| 88 |
+
5. GitHub + HF links
|