Qwen3-1.7B-SFT
Qwen3-1.7B-SFT is a supervised fine-tuned model based on Qwen3-1.7B-Base, trained on the DeepMath-4B dataset for mathematical reasoning and problem-solving.
This model is associated with the paper:
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Paper link: https://arxiv.org/abs/2604.13016
Model Description
This model is obtained by full-parameter supervised fine-tuning (SFT) from Qwen3-1.7B-Base.
The training is designed to improve the model's performance on math-focused instruction-following and reasoning tasks.
This model is intended in Section 5.1. Off-Policy Distillation from Teacher Rollouts as Cold Start.
Key characteristics
- Base model: Qwen3-1.7B-Base
- Training stage: Supervised Fine-Tuning (SFT)
- Finetuning type: Full finetuning
- Primary domain: Mathematical reasoning
- Thinking mode: Disabled during training (
enable_thinking: false) - Context length: 20480 tokens
Training Details
Training configuration
- Framework: LLaMA-Factory
- Stage:
sft - Finetuning type:
full - DeepSpeed config:
ds_z2_config.json - Kernel optimization:
enable_liger_kernel: True - Precision:
bf16 - Gradient checkpointing: enabled
- Learning rate:
1e-5 - Scheduler: cosine
- Warmup ratio:
0.1 - Number of epochs:
2.0 - Per-device train batch size:
16 - Gradient accumulation steps:
1 - Validation split:
0.01 - Evaluation strategy: every
100steps - Save strategy: every
100steps
Dataset
- Training dataset:
deep_math_4b
Training Hyperparameters
For reproducibility, the core configuration is summarized below:
model_name_or_path: ../model/Qwen3-1.7B-Base
trust_remote_code: true
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json
enable_liger_kernel: true
dataset: deep_math_4b
template: qwen3
enable_thinking: false
cutoff_len: 20480
preprocessing_num_workers: 64
dataloader_num_workers: 32
output_dir: ../model/Qwen3-1.7B-Base-SFT-DeepMath-4B
logging_steps: 5
save_steps: 100
plot_loss: true
overwrite_output_dir: true
save_only_model: true
report_to: swanlab
per_device_train_batch_size: 16
gradient_accumulation_steps: 1
gradient_checkpointing: true
learning_rate: 1.0e-5
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100
Citation
If you use this model, please consider citing the related paper:
@article{li2026rethinking,
title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
journal={arXiv preprint arXiv:2604.13016},
year={2026}
}
- Downloads last month
- 283
Model tree for lllyx/Qwen3-1.7B-SFT
Base model
Qwen/Qwen3-1.7B-Base