| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: transformers |
| base_model: Qwen/Qwen3-0.6B |
| tags: |
| - agent |
| - tool-use |
| - distillation |
| - math |
| - code |
| - reasoning |
| pipeline_tag: text-generation |
| --- |
| |
| <div align="center"> |
|
|
| <h1>SOD-0.6B</h1> |
|
|
| <p> |
| <a href="https://arxiv.org/abs/2605.07725"> |
| <img src="https://img.shields.io/badge/Paper-Arxiv-red?logo=arxiv&logoColor=red" alt="Paper on arXiv"/> |
| </a> |
| <a href="https://github.com/YoungZ365/SOD"> |
| <img src="https://img.shields.io/badge/Code-GitHub-black?logo=github&logoColor=white" alt="Code on GitHub"/> |
| </a> |
| <a href="https://huggingface.co/collections/youngzhong/sod-6a03530369d76913c24a4ffb"> |
| <img src="https://img.shields.io/badge/Collection-SOD-yellow?logo=huggingface" alt="HuggingFace Collection"/> |
| </a> |
| </p> |
| |
| </div> |
|
|
| ## About |
|
|
| **SOD-0.6B** is a 0.6B student model distilled from a 4B teacher using **SOD (Step-wise On-policy Distillation)**, a method designed for training small language model agents with tool-integrated reasoning capabilities. |
|
|
| SOD addresses the **cascading error propagation** problem in on-policy distillation for agentic reasoning by introducing an adaptive step-level weighting mechanism that suppresses distillation loss on drifted steps and restores supervision when the student recovers alignment β all at negligible additional computational cost. |
|
|
| ## Model Information |
|
|
| | Attribute | Value | |
| |-----------|-------| |
| | Base Model | [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | |
| | Teacher Model | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | |
| | Training Pipeline | Cold-Start SFT β SOD (Step-wise On-policy Distillation) | |
| | Parameters | 0.6B | |
|
|
| ## Related Models |
|
|
| | Model | Description | |
| |-------|-------------| |
| | [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) | SOD-distilled 0.6B student (this model) | |
| | [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) | SOD-distilled 1.7B student | |
| | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | GRPO-trained 4B teacher model | |
|
|
| ## Performance |
|
|
| We report **average@32** over 5 runs on challenging math, science, and code benchmarks. |
|
|
| ### 0.6B Student Results |
|
|
| | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |
| |--------|-----------|-----------|--------------|------------------|---------| |
| | Vanilla | 7.71 | 12.81 | 13.24 | 14.89 | 12.16 | |
| | SFT | 5.67 | 5.42 | 15.20 | 9.61 | 8.97 | |
| | GRPO | 4.06 | 4.90 | 20.38 | 15.95 | 11.32 | |
| | OPD | 16.82 | 22.95 | 17.76 | 22.65 | 20.04 | |
| | OPSD_gt | 12.63 | 17.04 | 17.32 | 16.73 | 15.93 | |
| | OPSD_hint | 9.77 | 14.12 | 15.98 | 12.65 | 13.13 | |
| | **SOD (This Model)** | **20.84** | **26.13** | **22.19** | **27.72** | **24.22** | |
|
|
| ### Teacher Model (4B) |
|
|
| | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |
| |--------|-----------|-----------|--------------|------------------|---------| |
| | GRPO | 67.60 | 60.42 | 55.19 | 63.13 | 61.59 | |
|
|
| ## Key Highlights |
|
|
| - π **Strong 0.6B agent**: Achieves **26.13%** on AIME 2025 (average@32) |
| - π **+20.86% over second-best baseline** (OPD) on average |
| - π‘ **Minimal extra compute**: The divergence metric reuses log-probabilities already computed in the forward pass |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhong2026sod, |
| title={SOD: Step-wise On-policy Distillation for Small Language Model Agents}, |
| author={Qiyong Zhong and Mao Zheng and Mingyang Song and Xin Lin and Jie Sun and Houcheng Jiang and Xiang Wang and Junfeng Fang}, |
| journal={arXiv preprint arXiv:2605.07725}, |
| year={2026} |
| } |
| ``` |
|
|