--- language: - en license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-0.6B tags: - agent - tool-use - distillation - math - code - reasoning pipeline_tag: text-generation ---

SOD-0.6B

Paper on arXiv Code on GitHub HuggingFace Collection

## About **SOD-0.6B** is a 0.6B student model distilled from a 4B teacher using **SOD (Step-wise On-policy Distillation)**, a method designed for training small language model agents with tool-integrated reasoning capabilities. SOD addresses the **cascading error propagation** problem in on-policy distillation for agentic reasoning by introducing an adaptive step-level weighting mechanism that suppresses distillation loss on drifted steps and restores supervision when the student recovers alignment — all at negligible additional computational cost. ## Model Information | Attribute | Value | |-----------|-------| | Base Model | [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | | Teacher Model | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | | Training Pipeline | Cold-Start SFT → SOD (Step-wise On-policy Distillation) | | Parameters | 0.6B | ## Related Models | Model | Description | |-------|-------------| | [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) | SOD-distilled 0.6B student (this model) | | [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) | SOD-distilled 1.7B student | | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | GRPO-trained 4B teacher model | ## Performance We report **average@32** over 5 runs on challenging math, science, and code benchmarks. ### 0.6B Student Results | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |--------|-----------|-----------|--------------|------------------|---------| | Vanilla | 7.71 | 12.81 | 13.24 | 14.89 | 12.16 | | SFT | 5.67 | 5.42 | 15.20 | 9.61 | 8.97 | | GRPO | 4.06 | 4.90 | 20.38 | 15.95 | 11.32 | | OPD | 16.82 | 22.95 | 17.76 | 22.65 | 20.04 | | OPSD_gt | 12.63 | 17.04 | 17.32 | 16.73 | 15.93 | | OPSD_hint | 9.77 | 14.12 | 15.98 | 12.65 | 13.13 | | **SOD (This Model)** | **20.84** | **26.13** | **22.19** | **27.72** | **24.22** | ### Teacher Model (4B) | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |--------|-----------|-----------|--------------|------------------|---------| | GRPO | 67.60 | 60.42 | 55.19 | 63.13 | 61.59 | ## Key Highlights - 🏆 **Strong 0.6B agent**: Achieves **26.13%** on AIME 2025 (average@32) - 📈 **+20.86% over second-best baseline** (OPD) on average - 💡 **Minimal extra compute**: The divergence metric reuses log-probabilities already computed in the forward pass ## Citation ```bibtex @article{zhong2026sod, title={SOD: Step-wise On-policy Distillation for Small Language Model Agents}, author={Qiyong Zhong and Mao Zheng and Mingyang Song and Xin Lin and Jie Sun and Houcheng Jiang and Xiang Wang and Junfeng Fang}, journal={arXiv preprint arXiv:2605.07725}, year={2026} } ```