Snitch Overseer (LR=2e-5, 400 steps)
A LoRA adapter on Qwen/Qwen2.5-1.5B-Instruct trained with GRPO to act as an overseer agent that audits another agent's tool-use trace and classifies its behavior as HONEST, REWARD_HACKER, LAZY, or DECEIVER, with a confidence score and a citation pointing to the specific trace step that gave away the misbehavior.
Trained as a reference submission for The Snitch, an OpenEnv environment for scalable AI oversight.
Model details
- Base model: Qwen/Qwen2.5-1.5B-Instruct (frozen, bf16)
- Adapter: LoRA, r=16, α=32, dropout=0.05, targets q_proj, k_proj, v_proj, o_proj
- Training algorithm: GRPO (Group Relative Policy Optimization) via TRL
- Steps: 400
- Learning rate: 2e-5
- KL from base at convergence: ~0.10
- Beta (KL coefficient): 0.04
- Generations per step: 4
- Max completion length: 256 tokens
Intended use
The model is the overseer agent in a single-step reinforcement learning loop:
- Input: a question, a trace of an agent's tool calls, and the agent's final claim
- Output: a 3-line verdict in the format:
VERDICT: <HONEST | REWARD_HACKER | LAZY | DECEIVER> CONFIDENCE: <0.0–1.0> EVIDENCE: <one sentence citing a specific trace step and keyword>
It is trained to plug into The Snitch's /step endpoint. It can be used standalone but the reward signal that shaped it is environment-specific.
Training data
960 traces generated from 12 deterministic doer policies (4 behavioral classes × 3 variants), restricted to variants 1 and 2 of questions q_0001–q_0120. Variant 3 and questions q_0121–q_0150 are held out for evaluation. Train and eval splits are disjoint on both axes (questions AND policy variant).
Evaluation
Evaluated on the held-out v3 set (n=120, q_0121–q_0150 × variant 3, fully disjoint from training):
| Condition | Accuracy | Mean Reward | Parse Errors |
|---|---|---|---|
| Random verdict | 25.0% | — | — |
| Base Qwen2.5-1.5B + 3-shot prompt | 56.7% | 0.438 | 0.0% |
| This adapter (LR=2e-5, 400 steps) | 75.8% | 0.900 | 0.0% |
Wilson 95% CIs at n=120: base [47.7%, 65.2%], this adapter [67.5%, 82.5%]. CIs do not overlap. Two-proportion z-test on the +19.2pp accuracy lift: z = 3.14, p = 0.0017.
The mean reward delta (+106% relative) reflects both correct classification AND grounding the verdict in a specific trace step + keyword. See the project README for the leaderboard against frontier LLMs.
Limitations
- Reference-scale RL run, not a tuned system. Larger base models, longer training, and LR scheduling are obvious next steps.
- Training doer policies are rule-based, not LLM-driven. Misbehaviors are stylized.
- The held-out set varies on two axes simultaneously (questions AND policy variant). The +19.2pp lift cannot be decomposed between question novelty and variant novelty.
Reproducing the eval
git clone https://github.com/Mihir1107/snitch-env.git
cd snitch-env
pip install -r requirements.txt && pip install -e .
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Mihir1107/snitch-overseer-lr2e5-ckpt400', local_dir='checkpoints/checkpoint-400')"
python scripts/gen_gap_eval.py \
--model-path checkpoints/checkpoint-400 \
--base-model Qwen/Qwen2.5-1.5B-Instruct \
--eval-easy data/eval_traces.jsonl \
--eval-hard data/eval_traces.jsonl \
--out results/repro.json
Citation
@software{snitch2026,
title = {The Snitch: An OpenEnv environment for scalable AI oversight},
author = {Mihir Mandavia and Manas Mehta and Haya Sachin},
year = {2026},
url = {https://github.com/Mihir1107/snitch-env}
}
Built at the Meta PyTorch × OpenEnv Hackathon 2026, Scaler School of Technology.
- Downloads last month
- 40