| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: transformers |
| base_model: Qwen/Qwen3-1.7B |
| tags: |
| - agent |
| - tool-use |
| - distillation |
| - math |
| - code |
| - reasoning |
| pipeline_tag: text-generation |
| --- |
| |
| <div align="center"> |
|
|
| <h1>SOD-1.7B</h1> |
|
|
| <p> |
| <a href="https://arxiv.org/abs/2605.07725"> |
| <img src="https://img.shields.io/badge/Paper-Arxiv-red?logo=arxiv&logoColor=red" alt="Paper on arXiv"/> |
| </a> |
| <a href="https://github.com/YoungZ365/SOD"> |
| <img src="https://img.shields.io/badge/Code-GitHub-black?logo=github&logoColor=white" alt="Code on GitHub"/> |
| </a> |
| <a href="https://huggingface.co/collections/youngzhong/sod-6a03530369d76913c24a4ffb"> |
| <img src="https://img.shields.io/badge/Collection-SOD-yellow?logo=huggingface" alt="HuggingFace Collection"/> |
| </a> |
| </p> |
| |
| </div> |
|
|
| ## About |
|
|
| **SOD-1.7B** is a 1.7B student model distilled from a 4B teacher using **SOD (Step-wise On-policy Distillation)**, a method designed for training small language model agents with tool-integrated reasoning capabilities. |
|
|
| SOD addresses the **cascading error propagation** problem in on-policy distillation for agentic reasoning by introducing an adaptive step-level weighting mechanism that suppresses distillation loss on drifted steps and restores supervision when the student recovers alignment β all at negligible additional computational cost. |
|
|
| ## Model Information |
|
|
| | Attribute | Value | |
| |-----------|-------| |
| | Base Model | [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | |
| | Teacher Model | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | |
| | Training Pipeline | Cold-Start SFT β SOD (Step-wise On-policy Distillation) | |
| | Parameters | 1.7B | |
|
|
| ## Related Models |
|
|
| | Model | Description | |
| |-------|-------------| |
| | [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) | SOD-distilled 0.6B student | |
| | [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) | SOD-distilled 1.7B student (this model) | |
| | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | GRPO-trained 4B teacher model | |
|
|
| ## Performance |
|
|
| We report **average@32** over 5 runs on challenging math, science, and code benchmarks. |
|
|
| ### 1.7B Student Results |
|
|
| | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |
| |--------|-----------|-----------|--------------|------------------|---------| |
| | Vanilla | 9.90 | 8.96 | 26.80 | 22.73 | 17.10 | |
| | SFT | 26.77 | 22.40 | 29.85 | 24.63 | 25.91 | |
| | GRPO | 25.63 | 21.67 | 33.55 | 20.70 | 25.39 | |
| | OPD | 43.86 | 37.04 | 31.73 | 32.45 | 36.27 | |
| | OPSD_gt | 33.85 | 24.69 | 35.02 | 22.73 | 29.07 | |
| | OPSD_hint | 34.42 | 21.43 | 33.46 | 23.12 | 28.11 | |
| | **SOD (This Model)** | **50.83** | **41.72** | **38.72** | **40.63** | **42.98** | |
|
|
| ### Teacher Model (4B) |
|
|
| | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |
| |--------|-----------|-----------|--------------|------------------|---------| |
| | GRPO | 67.60 | 60.42 | 55.19 | 63.13 | 61.59 | |
|
|
| ## Key Highlights |
|
|
| - π **Recovers 69.8% of teacher performance** with only 1.7B parameters (42.98 vs 61.59) |
| - π **+18.5% over second-best baseline** (OPD) on average |
| - π‘ **Minimal extra compute**: The divergence metric reuses log-probabilities already computed in the forward pass |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhong2026sod, |
| title={SOD: Step-wise On-policy Distillation for Small Language Model Agents}, |
| author={Qiyong Zhong and Mao Zheng and Mingyang Song and Xin Lin and Jie Sun and Houcheng Jiang and Xiang Wang and Junfeng Fang}, |
| journal={arXiv preprint arXiv:2605.07725}, |
| year={2026} |
| } |
| ``` |
|
|