Snitch Overseer (LR=2e-5, 400 steps)

A LoRA adapter on Qwen/Qwen2.5-1.5B-Instruct trained with GRPO to act as an overseer agent that audits another agent's tool-use trace and classifies its behavior as HONEST, REWARD_HACKER, LAZY, or DECEIVER, with a confidence score and a citation pointing to the specific trace step that gave away the misbehavior.

Trained as a reference submission for The Snitch, an OpenEnv environment for scalable AI oversight.

Model details

Base model: Qwen/Qwen2.5-1.5B-Instruct (frozen, bf16)
Adapter: LoRA, r=16, α=32, dropout=0.05, targets q_proj, k_proj, v_proj, o_proj
Training algorithm: GRPO (Group Relative Policy Optimization) via TRL
Steps: 400
Learning rate: 2e-5
KL from base at convergence: ~0.10
Beta (KL coefficient): 0.04
Generations per step: 4
Max completion length: 256 tokens

Intended use

The model is the overseer agent in a single-step reinforcement learning loop:

Input: a question, a trace of an agent's tool calls, and the agent's final claim

Output: a 3-line verdict in the format:

VERDICT: <HONEST | REWARD_HACKER | LAZY | DECEIVER>
CONFIDENCE: <0.0–1.0>
EVIDENCE: <one sentence citing a specific trace step and keyword>

It is trained to plug into The Snitch's /step endpoint. It can be used standalone but the reward signal that shaped it is environment-specific.

Training data

960 traces generated from 12 deterministic doer policies (4 behavioral classes × 3 variants), restricted to variants 1 and 2 of questions q_0001–q_0120. Variant 3 and questions q_0121–q_0150 are held out for evaluation. Train and eval splits are disjoint on both axes (questions AND policy variant).

Evaluation

Evaluated on the held-out v3 set (n=120, q_0121–q_0150 × variant 3, fully disjoint from training):

Condition	Accuracy	Mean Reward	Parse Errors
Random verdict	25.0%	—	—
Base Qwen2.5-1.5B + 3-shot prompt	56.7%	0.438	0.0%
This adapter (LR=2e-5, 400 steps)	75.8%	0.900	0.0%

Wilson 95% CIs at n=120: base [47.7%, 65.2%], this adapter [67.5%, 82.5%]. CIs do not overlap. Two-proportion z-test on the +19.2pp accuracy lift: z = 3.14, p = 0.0017.

The mean reward delta (+106% relative) reflects both correct classification AND grounding the verdict in a specific trace step + keyword. See the project README for the leaderboard against frontier LLMs.

Limitations

Reference-scale RL run, not a tuned system. Larger base models, longer training, and LR scheduling are obvious next steps.
Training doer policies are rule-based, not LLM-driven. Misbehaviors are stylized.
The held-out set varies on two axes simultaneously (questions AND policy variant). The +19.2pp lift cannot be decomposed between question novelty and variant novelty.

Reproducing the eval

git clone https://github.com/Mihir1107/snitch-env.git
cd snitch-env
pip install -r requirements.txt && pip install -e .

python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Mihir1107/snitch-overseer-lr2e5-ckpt400', local_dir='checkpoints/checkpoint-400')"

python scripts/gen_gap_eval.py \
  --model-path checkpoints/checkpoint-400 \
  --base-model Qwen/Qwen2.5-1.5B-Instruct \
  --eval-easy data/eval_traces.jsonl \
  --eval-hard data/eval_traces.jsonl \
  --out results/repro.json

Citation

@software{snitch2026,
  title  = {The Snitch: An OpenEnv environment for scalable AI oversight},
  author = {Mihir Mandavia and Manas Mehta and Haya Sachin},
  year   = {2026},
  url    = {https://github.com/Mihir1107/snitch-env}
}

Built at the Meta PyTorch × OpenEnv Hackathon 2026, Scaler School of Technology.

Downloads last month: 40

Model tree for Mihir1107/snitch-overseer-lr2e5-ckpt400

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(940)

this model