SFTQwen3-8B-OpenRubrics-v1-10pct-Weak

Qwen3-8B full fine-tuned on OpenRubrics v1 with ~12.3% weakened rubrics injected (38,367 total examples).

Training

  • Base model: Qwen/Qwen3-8B
  • Dataset: OpenRubrics-v1-10pct-Weak (38,367 examples)
    • 33,631 strong (original v1) + 4,736 weakened (~12.3%)
  • Epochs: 1
  • Learning rate: 8e-6 (cosine schedule)
  • Effective batch size: 128

Purpose

Trains a rubric generator that sometimes produces subtly exploitable rubrics. Weakened rubrics have criteria that are removed, softened, or vaguened compared to the originals. This model is used in the GRPO exploit training loop, where an adversarial policy learns to exploit weak rubrics while a judge rewards preference-aligned behavior.

Weakening Strategy

Weakened rubrics are generated by Llama-3.3-70B-Instruct via criterion-level rewrites that:

  • Remove important evaluation criteria
  • Soften strict requirements into suggestions
  • Vague precise criteria into ambiguous ones
Downloads last month
28
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chardizard/SFTQwen3-8B-OpenRubrics-v1-10pct-Weak

Finetuned
Qwen/Qwen3-8B
Finetuned
(1579)
this model