SynthAudit-Qwen2.5-3B-GRPO

A 3B parameter AI oversight agent trained to catch medical AI mistakes using GRPO reinforcement learning — $0 compute cost.

Model Description

This is a LoRA adapter fine-tuned on top of Qwen2.5-3B-Instruct using Group Relative Policy Optimization (GRPO) for 200 training steps.

The model was trained inside SynthAudit.Env — a multi-agent clinical AI oversight environment where:

An Actor AI (frozen) generates clinical trial proposals with subtle reasoning errors
The Oversight Agent (this model) learns to review, investigate, and flag those errors

The Problem It Solves

40,000+ patients die annually from diagnostic errors. As AI systems are deployed in clinical trials, the question becomes: who audits the AI? This model learns to do exactly that — through pure reinforcement learning, with zero supervised demonstrations.

Training Details

Parameter	Value
Base Model	Qwen/Qwen2.5-3B-Instruct
Method	GRPO (Group Relative Policy Optimization)
Quantization	4-bit QLoRA via Unsloth
LoRA Rank	16
Training Steps	200
Hardware	Google Colab T4 (free tier)
Training Time	2 hours 20 minutes
Compute Cost	$0
Peak Reward	0.506 (step 157)
Framework	TRL GRPOTrainer with `environment_factory`

Training Curriculum

Phase 1 (Steps 1–120): Simple age boundary errors, 4-6 proposals per episode
Phase 2 (Steps 121–170): Mixed error types, 6-8 proposals
Phase 3 (Steps 171–200): Full adversarial complexity, 8-11 proposals

Reward Model

Dense shaped reward with F-β score (β=1.5, recall-weighted):

+0.30 correct error flag
+0.15 correct approval
+0.12 relevant SHAP feature request
+0.10 temporal audit on error patient
+0.05 Theory-of-Mind bonus (explaining WHY the Actor was wrong)
-0.25 false positive
-0.003 per step (efficiency pressure)

Evaluation Results

Post-Training Evaluation (5 seeds × 3 difficulties)

Agent	Easy	Medium	Hard	Overall
Base Model (no training)	0.087	0.018	0.015	0.040
This Model (200-step GRPO)	0.287	0.129	0.044	0.153
Improvement	↑ 230%	↑ 617%	↑ 193%	↑ 283%

Key Metrics

Metric	Base	Trained
Correct Error Flags (15 episodes)	2	8 (4× more)
False Positives	6	11
Errors Caught per Episode	0.13	0.53

What the Model Learned

Through pure RL (no supervised data), the model learned to:

Review before deciding — emit review_proposal before flag_error
Investigate the correct patient — match patient_id from proposals
Provide specific reasoning — "Patient age 150 exceeds protocol maximum"
Approve compliant proposals — not flag everything indiscriminately

Training Reward Curve

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# Load trained adapter
model = PeftModel.from_pretrained(base_model, "Timusgeorge/SynthAudit-Qwen2.5-3B-GRPO")

# Run inference
messages = [
    {"role": "system", "content": "You are an AI Oversight Auditor..."},
    {"role": "user", "content": "PROTOCOL: ... PROPOSALS: ..."}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model-Agnostic Scalability

We intentionally validated with a 3B model on free hardware to prove the environment works under extreme constraints:

Model	Hardware	Score
3B (this adapter) ✅	Free Colab T4	0.153 (measured)
7B (Qwen2.5-7B)	A100	~0.25-0.35 (projected)
70B (Llama 3.3)	4×A100	~0.50-0.70 (projected)

If a $0-compute 3B model shows 283% improvement, the environment teaches genuine reasoning. Scaling is one config change.

Limitations

Reward fluctuation: Episode rewards range 0.13–0.54 due to procedural generation variance
Partial coverage: On 10+ proposal episodes, audits only 4-6 proposals before stopping
Error type generalization: Strong on age boundary errors; weaker on 2-hop comorbidity overrides
Scale: 3B parameters with 200 steps — larger models and longer training would improve results

Environment

Built with SynthAudit.Env for the Meta PyTorch OpenEnv Hackathon × Scaler School of Technology, Grand Finale 2026.

4 adversarial error types:

Age boundary violations — patients outside protocol limits
Temporal inconsistencies — death dates before treatment
Protocol window breaches — treatment outside enrollment windows
2-hop comorbidity overrides — Stage IV exceptions nullified by high comorbidity

8 oversight tools: review_proposal, investigate_patient, request_shap, cohort_analysis, temporal_audit, flag_error, approve, submit_audit_report

Author: Sumit Saraswat | Solo entry GitHub: SynthAudit.Env

Downloads last month: 65

Model tree for Timusgeorge/SynthAudit-Qwen2.5-3B-GRPO

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1227)

this model

Timusgeorge
/

SynthAudit-Qwen2.5-3B-GRPO