Qwen3-0.6B-SFT-Mixed-Reasoning
Developed by: Shreyansh327
License: apache-2.0
Finetuned from: unsloth/qwen3-0.6b
Training Framework: Unsloth + Hugging Face TRL (2x faster training)
Model Overview
Qwen3-0.6B-SFT-Mixed-Reasoning is a supervised fine-tuned version of Qwen3-0.6B, specifically designed to improve multi-step mathematical reasoning while preserving the model's factual knowledge base. This model is a direct product of research into the "Alignment Tax" in Small Language Models β the tendency for fine-tuning methods to degrade pre-trained knowledge when pushing for improved reasoning behavior.
The key innovation here is the data mixing curriculum: instead of training purely on open-ended reasoning traces (which caused severe catastrophic forgetting in ablation experiments), this model was trained on a carefully balanced mixture of reasoning, math, and factual science data.
The Problem This Model Solves
In our initial ablation experiments, fine-tuning Qwen3-0.6B on a pure reasoning dataset (Opus 4.6, 500 steps, LoRA r=32) caused:
- A 24.31% drop in ARC-Challenge (factual/science benchmark) accuracy
- The model learned the structure of reasoning (
<think>blocks,**Answer: B**formatting) but filled those blocks with overconfident hallucinations - Degenerate repetition loops frequently appeared during generation
The model had learned to look like it was reasoning without actually preserving its underlying knowledge β a classic manifestation of the alignment tax.
The Solution: Data Mixing Curriculum
By adopting a mixed-dataset approach at a fixed learning rate of 5e-5, the model was forced to simultaneously rehearse factual science knowledge while learning logical decomposition:
| Dataset | Mix % | Purpose |
|---|---|---|
Opus 4.6 Reasoning |
50% | Teach structured <think> block reasoning and multi-step decomposition |
GSM8K |
25% | Anchor mathematical accuracy and arithmetic grounding |
ARC-Challenge |
25% | "Rehearsal" dataset to prevent catastrophic forgetting of factual science knowledge |
Training Configuration
| Parameter | Value |
|---|---|
| Base Model | unsloth/qwen3-0.6b |
| Method | Supervised Fine-Tuning (SFT) via LoRA |
| LoRA Rank (r) | 32 |
| Learning Rate | 5e-5 |
| Training Steps | ~500 |
| Data Mix | 50% Opus + 25% GSM8K + 25% ARC |
| Repetition Penalty | 1.15 (to suppress degenerate loops) |
| Framework | Unsloth + TRL |
Evaluation Results
Evaluated on a GSM8K subset (n=50) and ARC-Challenge benchmark against the base Qwen3-0.6B checkpoint:
| Metric | Base Qwen3-0.6B | SFT (Opus Only) | SFT (Mixed β This Model) |
|---|---|---|---|
| GSM8K Accuracy | Baseline | Moderate gain | +6% absolute (+23% relative) β |
| ARC-Challenge | Baseline | -24.31% π | -2.56% β |
| Reasoning Style | None | Hallucinatory | Structured + Grounded |
Key Insight
The +6% absolute accuracy improvement on GSM8K was primarily driven by the model's ability to correctly decompose multi-step arithmetic inside <think> blocks. The base model frequently hallucinated final values on 3-step problems. The mixed-SFT model correctly identified intermediate sub-problems before arriving at the final answer.
Limitations
- Sample Size: GSM8K evaluation was conducted on a stratified subset (n=50). Full benchmark evaluation is in progress.
- ARC Regression: Even with data mixing, a small -2.56% regression on ARC-Challenge was observed β suggesting that SFT on reasoning data always carries some residual alignment tax.
- Scale: This model is 0.6B parameters. Results may not generalize to larger or smaller model families without re-tuning the data mix ratios.
- Verbosity: Unlike GRPO-trained models, SFT models tend to produce longer
<think>traces by imitating the verbose style of training data. For production use cases where inference cost is critical, consider the companion GRPO-optimized model.
How to Get Started
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Shreyansh327/Qwen3-0.6B-SFT-Mixed-Reasoning"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "A store sells apples for $1.50 each. If Maya buys 4 apples and pays with a $10 bill, how much change does she get?"
messages = [
{"role": "system", "content": "Think through the problem carefully inside <think> tags, then provide your final answer inside <answer> tags."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.6,
top_p=0.9,
repetition_penalty=1.15
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
