Snitch Overseer (LR=2e-5, 400 steps)

A LoRA adapter on Qwen/Qwen2.5-1.5B-Instruct trained with GRPO to act as an overseer agent that audits another agent's tool-use trace and classifies its behavior as HONEST, REWARD_HACKER, LAZY, or DECEIVER, with a confidence score and a citation pointing to the specific trace step that gave away the misbehavior.

Trained as a reference submission for The Snitch, an OpenEnv environment for scalable AI oversight.

Model details

  • Base model: Qwen/Qwen2.5-1.5B-Instruct (frozen, bf16)
  • Adapter: LoRA, r=16, α=32, dropout=0.05, targets q_proj, k_proj, v_proj, o_proj
  • Training algorithm: GRPO (Group Relative Policy Optimization) via TRL
  • Steps: 400
  • Learning rate: 2e-5
  • KL from base at convergence: ~0.10
  • Beta (KL coefficient): 0.04
  • Generations per step: 4
  • Max completion length: 256 tokens

Intended use

The model is the overseer agent in a single-step reinforcement learning loop:

  • Input: a question, a trace of an agent's tool calls, and the agent's final claim
  • Output: a 3-line verdict in the format:
    VERDICT: <HONEST | REWARD_HACKER | LAZY | DECEIVER>
    CONFIDENCE: <0.0–1.0>
    EVIDENCE: <one sentence citing a specific trace step and keyword>
    

It is trained to plug into The Snitch's /step endpoint. It can be used standalone but the reward signal that shaped it is environment-specific.

Training data

960 traces generated from 12 deterministic doer policies (4 behavioral classes × 3 variants), restricted to variants 1 and 2 of questions q_0001–q_0120. Variant 3 and questions q_0121–q_0150 are held out for evaluation. Train and eval splits are disjoint on both axes (questions AND policy variant).

Evaluation

Evaluated on the held-out v3 set (n=120, q_0121–q_0150 × variant 3, fully disjoint from training):

Condition Accuracy Mean Reward Parse Errors
Random verdict 25.0% — —
Base Qwen2.5-1.5B + 3-shot prompt 56.7% 0.438 0.0%
This adapter (LR=2e-5, 400 steps) 75.8% 0.900 0.0%

Wilson 95% CIs at n=120: base [47.7%, 65.2%], this adapter [67.5%, 82.5%]. CIs do not overlap. Two-proportion z-test on the +19.2pp accuracy lift: z = 3.14, p = 0.0017.

The mean reward delta (+106% relative) reflects both correct classification AND grounding the verdict in a specific trace step + keyword. See the project README for the leaderboard against frontier LLMs.

Limitations

  • Reference-scale RL run, not a tuned system. Larger base models, longer training, and LR scheduling are obvious next steps.
  • Training doer policies are rule-based, not LLM-driven. Misbehaviors are stylized.
  • The held-out set varies on two axes simultaneously (questions AND policy variant). The +19.2pp lift cannot be decomposed between question novelty and variant novelty.

Reproducing the eval

git clone https://github.com/Mihir1107/snitch-env.git
cd snitch-env
pip install -r requirements.txt && pip install -e .

python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Mihir1107/snitch-overseer-lr2e5-ckpt400', local_dir='checkpoints/checkpoint-400')"

python scripts/gen_gap_eval.py \
  --model-path checkpoints/checkpoint-400 \
  --base-model Qwen/Qwen2.5-1.5B-Instruct \
  --eval-easy data/eval_traces.jsonl \
  --eval-hard data/eval_traces.jsonl \
  --out results/repro.json

Citation

@software{snitch2026,
  title  = {The Snitch: An OpenEnv environment for scalable AI oversight},
  author = {Mihir Mandavia and Manas Mehta and Haya Sachin},
  year   = {2026},
  url    = {https://github.com/Mihir1107/snitch-env}
}

Built at the Meta PyTorch × OpenEnv Hackathon 2026, Scaler School of Technology.

Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mihir1107/snitch-overseer-lr2e5-ckpt400

Adapter
(940)
this model