Qwen3-1.7B-SFT

Qwen3-1.7B-SFT is a supervised fine-tuned model based on Qwen3-1.7B-Base, trained on the DeepMath-4B dataset for mathematical reasoning and problem-solving.

This model is associated with the paper:
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Paper link: https://arxiv.org/abs/2604.13016

Model Description

This model is obtained by full-parameter supervised fine-tuning (SFT) from Qwen3-1.7B-Base.
The training is designed to improve the model's performance on math-focused instruction-following and reasoning tasks. This model is intended in Section 5.1. Off-Policy Distillation from Teacher Rollouts as Cold Start.

Key characteristics

  • Base model: Qwen3-1.7B-Base
  • Training stage: Supervised Fine-Tuning (SFT)
  • Finetuning type: Full finetuning
  • Primary domain: Mathematical reasoning
  • Thinking mode: Disabled during training (enable_thinking: false)
  • Context length: 20480 tokens

Training Details

Training configuration

  • Framework: LLaMA-Factory
  • Stage: sft
  • Finetuning type: full
  • DeepSpeed config: ds_z2_config.json
  • Kernel optimization: enable_liger_kernel: True
  • Precision: bf16
  • Gradient checkpointing: enabled
  • Learning rate: 1e-5
  • Scheduler: cosine
  • Warmup ratio: 0.1
  • Number of epochs: 2.0
  • Per-device train batch size: 16
  • Gradient accumulation steps: 1
  • Validation split: 0.01
  • Evaluation strategy: every 100 steps
  • Save strategy: every 100 steps

Dataset

  • Training dataset: deep_math_4b

Training Hyperparameters

For reproducibility, the core configuration is summarized below:

model_name_or_path: ../model/Qwen3-1.7B-Base
trust_remote_code: true

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json
enable_liger_kernel: true

dataset: deep_math_4b
template: qwen3
enable_thinking: false
cutoff_len: 20480
preprocessing_num_workers: 64
dataloader_num_workers: 32

output_dir: ../model/Qwen3-1.7B-Base-SFT-DeepMath-4B
logging_steps: 5
save_steps: 100
plot_loss: true
overwrite_output_dir: true
save_only_model: true
report_to: swanlab

per_device_train_batch_size: 16
gradient_accumulation_steps: 1
gradient_checkpointing: true
learning_rate: 1.0e-5
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100

Citation

If you use this model, please consider citing the related paper:

@article{li2026rethinking,
  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
  journal={arXiv preprint arXiv:2604.13016},
  year={2026}
}

Paper: https://arxiv.org/abs/2604.13016

Downloads last month
283
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lllyx/Qwen3-1.7B-SFT

Finetuned
(315)
this model

Collection including lllyx/Qwen3-1.7B-SFT

Paper for lllyx/Qwen3-1.7B-SFT