Qwen3.5-27B-ERPD-003

Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3.5-27B. We adopt the MSE-based teacher loss and unlearned teacher setup described in the paper. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts × 16 rollouts per iteration. This checkpoint corresponds to the 3rd iterative round (ERPD-003).

📄 Paper: Extreme Region Policy Distillation
🏠 Project: https://github.com/ChangyuChen347/ERPD

Performance

Qwen3.5-27B Qwen3.5-397B-A17B Gemma4-31B Claude 4.5 Opus Qwen3.6-35B-A3B Qwen3.6-27B Qwen3.5-27B-ERPD-003
STEM & Reasoning
HMMT Feb 26 84.3 87.9 77.2 85.3 83.6 84.3 89.2
IMO Answer Bench 79.9 80.9 74.5 84.0 78.9 80.8 86.3

Sampling Parameters

We suggest using the following sampling parameters to reproduce the results on HMMT Feb 26 and IMO Answer Bench:

{
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "presence_penalty": 0.0,
  "repetition_penalty": 1.0,
  "max_tokens": 192000,
}

Note on Evaluation. Many problems in these benchmarks involve answers that cannot be reliably verified by exact-match comparison. We therefore employ Seed-2.0-pro as an LLM-as-judge to assess correctness. The evaluation scripts will be shared on our GitHub.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{chen2026extremeregionpolicydistillation,
      title={Extreme Region Policy Distillation}, 
      author={Changyu Chen and Xiting Wang and Rui Yan},
      year={2026},
      eprint={2605.25582},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.25582}, 
}
Downloads last month
15
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including adalaw/Qwen3.5-27B-ERPD-003

Paper for adalaw/Qwen3.5-27B-ERPD-003