SFTQwen3-8B-OpenRubrics-v1-10pct-Weak
Qwen3-8B full fine-tuned on OpenRubrics v1 with ~12.3% weakened rubrics injected (38,367 total examples).
Training
- Base model: Qwen/Qwen3-8B
- Dataset: OpenRubrics-v1-10pct-Weak (38,367 examples)
- 33,631 strong (original v1) + 4,736 weakened (~12.3%)
- Epochs: 1
- Learning rate: 8e-6 (cosine schedule)
- Effective batch size: 128
Purpose
Trains a rubric generator that sometimes produces subtly exploitable rubrics. Weakened rubrics have criteria that are removed, softened, or vaguened compared to the originals. This model is used in the GRPO exploit training loop, where an adversarial policy learns to exploit weak rubrics while a judge rewards preference-aligned behavior.
Weakening Strategy
Weakened rubrics are generated by Llama-3.3-70B-Instruct via criterion-level rewrites that:
- Remove important evaluation criteria
- Soften strict requirements into suggestions
- Vague precise criteria into ambiguous ones
- Downloads last month
- 28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support