SeaFill2025/Qwen3-4B-SFT

Qwen3-4B-SFT:

Qwen3-4B-SFT is a reasoning-focused model derived from Qwen3-4B-Base via full-parameter fine-tuning on the verl framework.

There is a notable shortage of reproducible 'warm-start' SFT bases in open-source practice, this model bridges the gap between base models and reinforcement learning models. Optimally aligned for Chain-of-Thought (CoT) and instruction following, it serves as a robust warm-start for Reinforcement Learning.

Benchmark Snapshot

Compared to the Base (4B) model, Qwen3-4B-SFT demonstrates significant performance improvements in reasoning and mathematics. The reported figures represent the Pass@1 accuracy, calculated as the average of dataset-level accuracies across 16 independent runs.

Dataset	Base (4B)	Qwen3-4B-SFT (this model)	Improvement (Absolute)
AIME 2024	11.25%	20.8%	+9.55%
AIME 2025	6.46%	19.4%	+12.94%
AMC 2023	31.09%	58.0%	+26.91%
GPQA-Diamond	7.77%	29.1%	+21.33%

Dataset card used for SFT: https://huggingface.co/datasets/96kevinli29/SFT-Dataset

Qwen3-style reasoning and instruction following

Minimal pattern (illustrative):

<|im_start|>user
… Among options A–D, which is correct? Reason step by step and put the final letter in \boxed{}.
<|im_end|>

<|im_start|>assistant
<think>
Compare A vs B vs C vs D against the stem; eliminate …; D remains consistent with …
</think>
Step-by-step: … (short derivation in the visible channel)
Final answer: \boxed{D}
<|im_end|>

Use a large enough max_new_tokens on hard math so both the reasoning block and the visible \boxed{…} line fit before generation stops.

Configuration Notes

Template: Trained with the Qwen chat template; learns to end responses with <|im_end|> (151645).
Suggested Configuration:
```
{
  "eos_token_id": 151645
}
```

You may adjust settings according to your training or deployment needs.

Training Infrastructure

Cluster: MeluXina Supercomputer (LuxProvide)
Node Config: 4 NVIDIA-A100 GPUs per node.
Final SFT Run: 12 Node-hours (16× A100 for 3 hours)
Total R&D Investment: ~700 Node-hours (Includes data ablation, hyperparameter sweeps, and extensive benchmark evaluation.)

Project Links

Training code repository: https://github.com/96kevinli29/base-model-sft-verl

Limitations

Not optimized for factual correctness in all domains
May still produce hallucinations or unsafe outputs
Performance is sensitive to prompt style and decoding settings

Citation

If you use this model, please cite this checkpoint, bibTeX for this release :

@misc{qwen3-4b-sft-2026,
  title        = {{Qwen3-4B-SFT}: Supervised Fine-Tuned {Qwen3}-4B for Reasoning},
  author       = {Hongyang Li, Xiao Li and {Sea-Fill Community}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SeaFill2025/Qwen3-4B-SFT}},
  note         = {Checkpoint trained with verl; warm-start for pre-RL alignment research. Maintained by Sea-Fill Community.}
}

Downloads last month: 3,845

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for SeaFill2025/Qwen3-4B-SFT

Base model

Qwen/Qwen3-4B-Base

Finetuned

(263)

this model

Evaluation results

accuracy on AIME 2024
self-reported

20.800
accuracy on AIME 2025
self-reported

19.400
accuracy on AMC 2023
self-reported

58.000
accuracy on GPQA-Diamond
self-reported

29.100