--- language: - en license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-1.7B tags: - agent - tool-use - distillation - math - code - reasoning pipeline_tag: text-generation ---
## About **SOD-1.7B** is a 1.7B student model distilled from a 4B teacher using **SOD (Step-wise On-policy Distillation)**, a method designed for training small language model agents with tool-integrated reasoning capabilities. SOD addresses the **cascading error propagation** problem in on-policy distillation for agentic reasoning by introducing an adaptive step-level weighting mechanism that suppresses distillation loss on drifted steps and restores supervision when the student recovers alignment — all at negligible additional computational cost. ## Model Information | Attribute | Value | |-----------|-------| | Base Model | [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | | Teacher Model | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | | Training Pipeline | Cold-Start SFT → SOD (Step-wise On-policy Distillation) | | Parameters | 1.7B | ## Related Models | Model | Description | |-------|-------------| | [SOD-0.6B](https://huggingface.co/youngzhong/SOD-0.6B) | SOD-distilled 0.6B student | | [SOD-1.7B](https://huggingface.co/youngzhong/SOD-1.7B) | SOD-distilled 1.7B student (this model) | | [SOD-GRPO_teacher-4B](https://huggingface.co/youngzhong/SOD-GRPO_teacher-4B) | GRPO-trained 4B teacher model | ## Performance We report **average@32** over 5 runs on challenging math, science, and code benchmarks. ### 1.7B Student Results | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |--------|-----------|-----------|--------------|------------------|---------| | Vanilla | 9.90 | 8.96 | 26.80 | 22.73 | 17.10 | | SFT | 26.77 | 22.40 | 29.85 | 24.63 | 25.91 | | GRPO | 25.63 | 21.67 | 33.55 | 20.70 | 25.39 | | OPD | 43.86 | 37.04 | 31.73 | 32.45 | 36.27 | | OPSD_gt | 33.85 | 24.69 | 35.02 | 22.73 | 29.07 | | OPSD_hint | 34.42 | 21.43 | 33.46 | 23.12 | 28.11 | | **SOD (This Model)** | **50.83** | **41.72** | **38.72** | **40.63** | **42.98** | ### Teacher Model (4B) | Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average | |--------|-----------|-----------|--------------|------------------|---------| | GRPO | 67.60 | 60.42 | 55.19 | 63.13 | 61.59 | ## Key Highlights - 🏆 **Recovers 69.8% of teacher performance** with only 1.7B parameters (42.98 vs 61.59) - 📈 **+18.5% over second-best baseline** (OPD) on average - 💡 **Minimal extra compute**: The divergence metric reuses log-probabilities already computed in the forward pass ## Citation ```bibtex @article{zhong2026sod, title={SOD: Step-wise On-policy Distillation for Small Language Model Agents}, author={Qiyong Zhong and Mao Zheng and Mingyang Song and Xin Lin and Jie Sun and Houcheng Jiang and Xiang Wang and Junfeng Fang}, journal={arXiv preprint arXiv:2605.07725}, year={2026} } ```