SynthAudit-Qwen2.5-3B-GRPO

A 3B parameter AI oversight agent trained to catch medical AI mistakes using GRPO reinforcement learning β€” $0 compute cost.

Base vs Trained

Model Description

This is a LoRA adapter fine-tuned on top of Qwen2.5-3B-Instruct using Group Relative Policy Optimization (GRPO) for 200 training steps.

The model was trained inside SynthAudit.Env β€” a multi-agent clinical AI oversight environment where:

  • An Actor AI (frozen) generates clinical trial proposals with subtle reasoning errors
  • The Oversight Agent (this model) learns to review, investigate, and flag those errors

The Problem It Solves

40,000+ patients die annually from diagnostic errors. As AI systems are deployed in clinical trials, the question becomes: who audits the AI? This model learns to do exactly that β€” through pure reinforcement learning, with zero supervised demonstrations.

Training Details

Parameter Value
Base Model Qwen/Qwen2.5-3B-Instruct
Method GRPO (Group Relative Policy Optimization)
Quantization 4-bit QLoRA via Unsloth
LoRA Rank 16
Training Steps 200
Hardware Google Colab T4 (free tier)
Training Time 2 hours 20 minutes
Compute Cost $0
Peak Reward 0.506 (step 157)
Framework TRL GRPOTrainer with environment_factory

Training Curriculum

  • Phase 1 (Steps 1–120): Simple age boundary errors, 4-6 proposals per episode
  • Phase 2 (Steps 121–170): Mixed error types, 6-8 proposals
  • Phase 3 (Steps 171–200): Full adversarial complexity, 8-11 proposals

Reward Model

Dense shaped reward with F-Ξ² score (Ξ²=1.5, recall-weighted):

  • +0.30 correct error flag
  • +0.15 correct approval
  • +0.12 relevant SHAP feature request
  • +0.10 temporal audit on error patient
  • +0.05 Theory-of-Mind bonus (explaining WHY the Actor was wrong)
  • -0.25 false positive
  • -0.003 per step (efficiency pressure)

Evaluation Results

Post-Training Evaluation (5 seeds Γ— 3 difficulties)

Agent Easy Medium Hard Overall
Base Model (no training) 0.087 0.018 0.015 0.040
This Model (200-step GRPO) 0.287 0.129 0.044 0.153
Improvement ↑ 230% ↑ 617% ↑ 193% ↑ 283%

Key Metrics

Metric Base Trained
Correct Error Flags (15 episodes) 2 8 (4Γ— more)
False Positives 6 11
Errors Caught per Episode 0.13 0.53

What the Model Learned

Through pure RL (no supervised data), the model learned to:

  1. Review before deciding β€” emit review_proposal before flag_error
  2. Investigate the correct patient β€” match patient_id from proposals
  3. Provide specific reasoning β€” "Patient age 150 exceeds protocol maximum"
  4. Approve compliant proposals β€” not flag everything indiscriminately

Training Reward Curve

Reward Curve

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# Load trained adapter
model = PeftModel.from_pretrained(base_model, "Timusgeorge/SynthAudit-Qwen2.5-3B-GRPO")

# Run inference
messages = [
    {"role": "system", "content": "You are an AI Oversight Auditor..."},
    {"role": "user", "content": "PROTOCOL: ... PROPOSALS: ..."}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model-Agnostic Scalability

We intentionally validated with a 3B model on free hardware to prove the environment works under extreme constraints:

Model Hardware Score
3B (this adapter) βœ… Free Colab T4 0.153 (measured)
7B (Qwen2.5-7B) A100 ~0.25-0.35 (projected)
70B (Llama 3.3) 4Γ—A100 ~0.50-0.70 (projected)

If a $0-compute 3B model shows 283% improvement, the environment teaches genuine reasoning. Scaling is one config change.

Limitations

  • Reward fluctuation: Episode rewards range 0.13–0.54 due to procedural generation variance
  • Partial coverage: On 10+ proposal episodes, audits only 4-6 proposals before stopping
  • Error type generalization: Strong on age boundary errors; weaker on 2-hop comorbidity overrides
  • Scale: 3B parameters with 200 steps β€” larger models and longer training would improve results

Environment

Built with SynthAudit.Env for the Meta PyTorch OpenEnv Hackathon Γ— Scaler School of Technology, Grand Finale 2026.

4 adversarial error types:

  1. Age boundary violations β€” patients outside protocol limits
  2. Temporal inconsistencies β€” death dates before treatment
  3. Protocol window breaches β€” treatment outside enrollment windows
  4. 2-hop comorbidity overrides β€” Stage IV exceptions nullified by high comorbidity

8 oversight tools: review_proposal, investigate_patient, request_shap, cohort_analysis, temporal_audit, flag_error, approve, submit_audit_report


Author: Sumit Saraswat | Solo entry GitHub: SynthAudit.Env

Downloads last month
65
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Timusgeorge/SynthAudit-Qwen2.5-3B-GRPO

Base model

Qwen/Qwen2.5-3B
Adapter
(1227)
this model

Space using Timusgeorge/SynthAudit-Qwen2.5-3B-GRPO 1