Qwen3.5-27B-ERPD-003
Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3.5-27B. We adopt the MSE-based teacher loss and unlearned teacher setup described in the paper. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts × 16 rollouts per iteration. This checkpoint corresponds to the 3rd iterative round (ERPD-003).
📄 Paper: Extreme Region Policy Distillation
🏠 Project: https://github.com/ChangyuChen347/ERPD
Performance
| Qwen3.5-27B | Qwen3.5-397B-A17B | Gemma4-31B | Claude 4.5 Opus | Qwen3.6-35B-A3B | Qwen3.6-27B | Qwen3.5-27B-ERPD-003 | |
|---|---|---|---|---|---|---|---|
| STEM & Reasoning | |||||||
| HMMT Feb 26 | 84.3 | 87.9 | 77.2 | 85.3 | 83.6 | 84.3 | 89.2 |
| IMO Answer Bench | 79.9 | 80.9 | 74.5 | 84.0 | 78.9 | 80.8 | 86.3 |
Sampling Parameters
We suggest using the following sampling parameters to reproduce the results on HMMT Feb 26 and IMO Answer Bench:
{
"temperature": 1,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 0.0,
"repetition_penalty": 1.0,
"max_tokens": 192000,
}
Note on Evaluation. Many problems in these benchmarks involve answers that cannot be reliably verified by exact-match comparison. We therefore employ Seed-2.0-pro as an LLM-as-judge to assess correctness. The evaluation scripts will be shared on our GitHub.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{chen2026extremeregionpolicydistillation,
title={Extreme Region Policy Distillation},
author={Changyu Chen and Xiting Wang and Rui Yan},
year={2026},
eprint={2605.25582},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.25582},
}
- Downloads last month
- 15