Spaces:
Sleeping
Sleeping
Bhavishya011 commited on
Commit ·
9d64412
1
Parent(s): 696431a
docs: add concrete statistics and cite arXiv:2601.19100
Browse files- README.md +9 -1
- docs/HFBlogPost.md +3 -1
README.md
CHANGED
|
@@ -18,7 +18,7 @@ An OpenEnv-compliant RL training environment where AI agents act as **FDA Lead R
|
|
| 18 |
|
| 19 |
## 🛑 The Verification Gap
|
| 20 |
|
| 21 |
-
The scientific replication crisis affects an estimated 50–70% of published research. In clinical trials, a methodological flaw or undisclosed data exclusion isn't just a statistical error—it costs lives.
|
| 22 |
|
| 23 |
Current LLMs are incredible at *generation* but fail catastrophically at rigorous, multi-step *verification*. When presented with a complex clinical protocol and raw CSV data, baseline models hallucinate, struggle to follow strict reporting schemas, and fail to independently verify biostatistical claims.
|
| 24 |
|
|
@@ -26,6 +26,14 @@ Current LLMs are incredible at *generation* but fail catastrophically at rigorou
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
## 📦 Submission Artifacts
|
| 30 |
|
| 31 |
| Artifact | Link |
|
|
|
|
| 18 |
|
| 19 |
## 🛑 The Verification Gap
|
| 20 |
|
| 21 |
+
The scientific replication crisis affects an estimated 50–70% of published research. In clinical trials, a methodological flaw or undisclosed data exclusion isn't just a statistical error—it costs lives. Over 10,000 biomedical papers have been retracted in the last decade, but only after potentially affecting **hundreds of thousands of enrolled patients** and leading to dangerous, wide-scale drug approvals (Source: *RetractionWatch / BMJ*).
|
| 22 |
|
| 23 |
Current LLMs are incredible at *generation* but fail catastrophically at rigorous, multi-step *verification*. When presented with a complex clinical protocol and raw CSV data, baseline models hallucinate, struggle to follow strict reporting schemas, and fail to independently verify biostatistical claims.
|
| 24 |
|
|
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
| 29 |
+
## 📚 Citations & Inspiration
|
| 30 |
+
|
| 31 |
+
Our deterministic RLVR reward shaping strategies and verifiable environment design are heavily inspired by recent advancements in Reinforcement Learning for verifiable reasoning.
|
| 32 |
+
|
| 33 |
+
* **Reward Ideas:** Inspired by [arXiv:2601.19100](https://arxiv.org/abs/2601.19100).
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
## 📦 Submission Artifacts
|
| 38 |
|
| 39 |
| Artifact | Link |
|
docs/HFBlogPost.md
CHANGED
|
@@ -6,10 +6,12 @@
|
|
| 6 |
|
| 7 |
When building AI agents to handle high-stakes regulatory tasks—like auditing clinical trials or reviewing New Drug Applications (NDAs)—the industry relies heavily on "LLM-as-a-Judge" evaluation.
|
| 8 |
|
| 9 |
-
The problem? LLMs are easily manipulated soft graders. They hallucinate partial credit, struggle with strict biostatistics, and can be fooled by polite, confident formatting. In a software demo, that’s an edge case. At the FDA, a hallucinated drug approval costs lives.
|
| 10 |
|
| 11 |
For the Meta PyTorch × Scaler Hackathon, our team decided to kill the soft-grading system. We built **PeerGuard**, an autonomous Review Board agent trained entirely inside a deterministic OpenEnv sandbox. If the math is wrong, the agent gets a zero. No partial credit. No vibes. Just Reinforcement Learning with Verifiable Rewards (RLVR).
|
| 12 |
|
|
|
|
|
|
|
| 13 |
Here is how we built it, why our early training runs collapsed, and how the agent achieved zero-shot generalization on raw CSV data.
|
| 14 |
|
| 15 |
## The Architecture: Deterministic Sandboxing
|
|
|
|
| 6 |
|
| 7 |
When building AI agents to handle high-stakes regulatory tasks—like auditing clinical trials or reviewing New Drug Applications (NDAs)—the industry relies heavily on "LLM-as-a-Judge" evaluation.
|
| 8 |
|
| 9 |
+
The problem? LLMs are easily manipulated soft graders. They hallucinate partial credit, struggle with strict biostatistics, and can be fooled by polite, confident formatting. In a software demo, that’s an edge case. At the FDA, a hallucinated drug approval costs lives. Over 10,000 biomedical papers have been retracted in the last decade, but only after potentially affecting **hundreds of thousands of enrolled patients** and leading to dangerous, wide-scale drug approvals (Source: *RetractionWatch / BMJ*).
|
| 10 |
|
| 11 |
For the Meta PyTorch × Scaler Hackathon, our team decided to kill the soft-grading system. We built **PeerGuard**, an autonomous Review Board agent trained entirely inside a deterministic OpenEnv sandbox. If the math is wrong, the agent gets a zero. No partial credit. No vibes. Just Reinforcement Learning with Verifiable Rewards (RLVR).
|
| 12 |
|
| 13 |
+
Our RLVR reward shaping strategies and verifiable environment design are heavily inspired by recent advancements in Reinforcement Learning, specifically the reward ideas outlined in [arXiv:2601.19100](https://arxiv.org/abs/2601.19100).
|
| 14 |
+
|
| 15 |
Here is how we built it, why our early training runs collapsed, and how the agent achieved zero-shot generalization on raw CSV data.
|
| 16 |
|
| 17 |
## The Architecture: Deterministic Sandboxing
|