Qwen3.5-27B-ERPD-003

Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3.5-27B. We adopt the MSE-based teacher loss and unlearned teacher setup described in the paper. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts × 16 rollouts per iteration. This checkpoint corresponds to the 3rd iterative round (ERPD-003).

📄 Paper: Extreme Region Policy Distillation
🏠 Project: https://github.com/ChangyuChen347/ERPD

Performance

	Qwen3.5-27B	Qwen3.5-397B-A17B	Gemma4-31B	Claude 4.5 Opus	Qwen3.6-35B-A3B	Qwen3.6-27B	Qwen3.5-27B-ERPD-003
STEM & Reasoning
HMMT Feb 26	84.3	87.9	77.2	85.3	83.6	84.3	89.2
IMO Answer Bench	79.9	80.9	74.5	84.0	78.9	80.8	86.3

Sampling Parameters

We suggest using the following sampling parameters to reproduce the results on HMMT Feb 26 and IMO Answer Bench:

{
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "presence_penalty": 0.0,
  "repetition_penalty": 1.0,
  "max_tokens": 192000,
}

Note on Evaluation. Many problems in these benchmarks involve answers that cannot be reliably verified by exact-match comparison. We therefore employ Seed-2.0-pro as an LLM-as-judge to assess correctness. The evaluation scripts will be shared on our GitHub.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{chen2026extremeregionpolicydistillation,
      title={Extreme Region Policy Distillation}, 
      author={Changyu Chen and Xiting Wang and Rui Yan},
      year={2026},
      eprint={2605.25582},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.25582}, 
}

Downloads last month: 15

Safetensors

Model size

27B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including adalaw/Qwen3.5-27B-ERPD-003

Extreme Region Policy Distillation

Collection

4 items • Updated about 21 hours ago

Paper for adalaw/Qwen3.5-27B-ERPD-003

Extreme Region Policy Distillation

Paper • 2605.25582 • Published 2 days ago