SynthAudit-Qwen2.5-3B-GRPO
A 3B parameter AI oversight agent trained to catch medical AI mistakes using GRPO reinforcement learning β $0 compute cost.
Model Description
This is a LoRA adapter fine-tuned on top of Qwen2.5-3B-Instruct using Group Relative Policy Optimization (GRPO) for 200 training steps.
The model was trained inside SynthAudit.Env β a multi-agent clinical AI oversight environment where:
- An Actor AI (frozen) generates clinical trial proposals with subtle reasoning errors
- The Oversight Agent (this model) learns to review, investigate, and flag those errors
The Problem It Solves
40,000+ patients die annually from diagnostic errors. As AI systems are deployed in clinical trials, the question becomes: who audits the AI? This model learns to do exactly that β through pure reinforcement learning, with zero supervised demonstrations.
Training Details
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-3B-Instruct |
| Method | GRPO (Group Relative Policy Optimization) |
| Quantization | 4-bit QLoRA via Unsloth |
| LoRA Rank | 16 |
| Training Steps | 200 |
| Hardware | Google Colab T4 (free tier) |
| Training Time | 2 hours 20 minutes |
| Compute Cost | $0 |
| Peak Reward | 0.506 (step 157) |
| Framework | TRL GRPOTrainer with environment_factory |
Training Curriculum
- Phase 1 (Steps 1β120): Simple age boundary errors, 4-6 proposals per episode
- Phase 2 (Steps 121β170): Mixed error types, 6-8 proposals
- Phase 3 (Steps 171β200): Full adversarial complexity, 8-11 proposals
Reward Model
Dense shaped reward with F-Ξ² score (Ξ²=1.5, recall-weighted):
+0.30correct error flag+0.15correct approval+0.12relevant SHAP feature request+0.10temporal audit on error patient+0.05Theory-of-Mind bonus (explaining WHY the Actor was wrong)-0.25false positive-0.003per step (efficiency pressure)
Evaluation Results
Post-Training Evaluation (5 seeds Γ 3 difficulties)
| Agent | Easy | Medium | Hard | Overall |
|---|---|---|---|---|
| Base Model (no training) | 0.087 | 0.018 | 0.015 | 0.040 |
| This Model (200-step GRPO) | 0.287 | 0.129 | 0.044 | 0.153 |
| Improvement | β 230% | β 617% | β 193% | β 283% |
Key Metrics
| Metric | Base | Trained |
|---|---|---|
| Correct Error Flags (15 episodes) | 2 | 8 (4Γ more) |
| False Positives | 6 | 11 |
| Errors Caught per Episode | 0.13 | 0.53 |
What the Model Learned
Through pure RL (no supervised data), the model learned to:
- Review before deciding β emit
review_proposalbeforeflag_error - Investigate the correct patient β match
patient_idfrom proposals - Provide specific reasoning β "Patient age 150 exceeds protocol maximum"
- Approve compliant proposals β not flag everything indiscriminately
Training Reward Curve
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
# Load trained adapter
model = PeftModel.from_pretrained(base_model, "Timusgeorge/SynthAudit-Qwen2.5-3B-GRPO")
# Run inference
messages = [
{"role": "system", "content": "You are an AI Oversight Auditor..."},
{"role": "user", "content": "PROTOCOL: ... PROPOSALS: ..."}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Model-Agnostic Scalability
We intentionally validated with a 3B model on free hardware to prove the environment works under extreme constraints:
| Model | Hardware | Score |
|---|---|---|
| 3B (this adapter) β | Free Colab T4 | 0.153 (measured) |
| 7B (Qwen2.5-7B) | A100 | ~0.25-0.35 (projected) |
| 70B (Llama 3.3) | 4ΓA100 | ~0.50-0.70 (projected) |
If a $0-compute 3B model shows 283% improvement, the environment teaches genuine reasoning. Scaling is one config change.
Limitations
- Reward fluctuation: Episode rewards range 0.13β0.54 due to procedural generation variance
- Partial coverage: On 10+ proposal episodes, audits only 4-6 proposals before stopping
- Error type generalization: Strong on age boundary errors; weaker on 2-hop comorbidity overrides
- Scale: 3B parameters with 200 steps β larger models and longer training would improve results
Environment
Built with SynthAudit.Env for the Meta PyTorch OpenEnv Hackathon Γ Scaler School of Technology, Grand Finale 2026.
4 adversarial error types:
- Age boundary violations β patients outside protocol limits
- Temporal inconsistencies β death dates before treatment
- Protocol window breaches β treatment outside enrollment windows
- 2-hop comorbidity overrides β Stage IV exceptions nullified by high comorbidity
8 oversight tools: review_proposal, investigate_patient, request_shap, cohort_analysis, temporal_audit, flag_error, approve, submit_audit_report
Author: Sumit Saraswat | Solo entry GitHub: SynthAudit.Env
- Downloads last month
- 65

