Bhavishya011 commited on
Commit
9d64412
·
1 Parent(s): 696431a

docs: add concrete statistics and cite arXiv:2601.19100

Browse files
Files changed (2) hide show
  1. README.md +9 -1
  2. docs/HFBlogPost.md +3 -1
README.md CHANGED
@@ -18,7 +18,7 @@ An OpenEnv-compliant RL training environment where AI agents act as **FDA Lead R
18
 
19
  ## 🛑 The Verification Gap
20
 
21
- The scientific replication crisis affects an estimated 50–70% of published research. In clinical trials, a methodological flaw or undisclosed data exclusion isn't just a statistical error—it costs lives.
22
 
23
  Current LLMs are incredible at *generation* but fail catastrophically at rigorous, multi-step *verification*. When presented with a complex clinical protocol and raw CSV data, baseline models hallucinate, struggle to follow strict reporting schemas, and fail to independently verify biostatistical claims.
24
 
@@ -26,6 +26,14 @@ Current LLMs are incredible at *generation* but fail catastrophically at rigorou
26
 
27
  ---
28
 
 
 
 
 
 
 
 
 
29
  ## 📦 Submission Artifacts
30
 
31
  | Artifact | Link |
 
18
 
19
  ## 🛑 The Verification Gap
20
 
21
+ The scientific replication crisis affects an estimated 50–70% of published research. In clinical trials, a methodological flaw or undisclosed data exclusion isn't just a statistical error—it costs lives. Over 10,000 biomedical papers have been retracted in the last decade, but only after potentially affecting **hundreds of thousands of enrolled patients** and leading to dangerous, wide-scale drug approvals (Source: *RetractionWatch / BMJ*).
22
 
23
  Current LLMs are incredible at *generation* but fail catastrophically at rigorous, multi-step *verification*. When presented with a complex clinical protocol and raw CSV data, baseline models hallucinate, struggle to follow strict reporting schemas, and fail to independently verify biostatistical claims.
24
 
 
26
 
27
  ---
28
 
29
+ ## 📚 Citations & Inspiration
30
+
31
+ Our deterministic RLVR reward shaping strategies and verifiable environment design are heavily inspired by recent advancements in Reinforcement Learning for verifiable reasoning.
32
+
33
+ * **Reward Ideas:** Inspired by [arXiv:2601.19100](https://arxiv.org/abs/2601.19100).
34
+
35
+ ---
36
+
37
  ## 📦 Submission Artifacts
38
 
39
  | Artifact | Link |
docs/HFBlogPost.md CHANGED
@@ -6,10 +6,12 @@
6
 
7
  When building AI agents to handle high-stakes regulatory tasks—like auditing clinical trials or reviewing New Drug Applications (NDAs)—the industry relies heavily on "LLM-as-a-Judge" evaluation.
8
 
9
- The problem? LLMs are easily manipulated soft graders. They hallucinate partial credit, struggle with strict biostatistics, and can be fooled by polite, confident formatting. In a software demo, that’s an edge case. At the FDA, a hallucinated drug approval costs lives.
10
 
11
  For the Meta PyTorch × Scaler Hackathon, our team decided to kill the soft-grading system. We built **PeerGuard**, an autonomous Review Board agent trained entirely inside a deterministic OpenEnv sandbox. If the math is wrong, the agent gets a zero. No partial credit. No vibes. Just Reinforcement Learning with Verifiable Rewards (RLVR).
12
 
 
 
13
  Here is how we built it, why our early training runs collapsed, and how the agent achieved zero-shot generalization on raw CSV data.
14
 
15
  ## The Architecture: Deterministic Sandboxing
 
6
 
7
  When building AI agents to handle high-stakes regulatory tasks—like auditing clinical trials or reviewing New Drug Applications (NDAs)—the industry relies heavily on "LLM-as-a-Judge" evaluation.
8
 
9
+ The problem? LLMs are easily manipulated soft graders. They hallucinate partial credit, struggle with strict biostatistics, and can be fooled by polite, confident formatting. In a software demo, that’s an edge case. At the FDA, a hallucinated drug approval costs lives. Over 10,000 biomedical papers have been retracted in the last decade, but only after potentially affecting **hundreds of thousands of enrolled patients** and leading to dangerous, wide-scale drug approvals (Source: *RetractionWatch / BMJ*).
10
 
11
  For the Meta PyTorch × Scaler Hackathon, our team decided to kill the soft-grading system. We built **PeerGuard**, an autonomous Review Board agent trained entirely inside a deterministic OpenEnv sandbox. If the math is wrong, the agent gets a zero. No partial credit. No vibes. Just Reinforcement Learning with Verifiable Rewards (RLVR).
12
 
13
+ Our RLVR reward shaping strategies and verifiable environment design are heavily inspired by recent advancements in Reinforcement Learning, specifically the reward ideas outlined in [arXiv:2601.19100](https://arxiv.org/abs/2601.19100).
14
+
15
  Here is how we built it, why our early training runs collapsed, and how the agent achieved zero-shot generalization on raw CSV data.
16
 
17
  ## The Architecture: Deterministic Sandboxing