Qwen3-0.6B-SFT-Mixed-Reasoning

Developed by: Shreyansh327
License: apache-2.0
Finetuned from: unsloth/qwen3-0.6b
Training Framework: Unsloth + Hugging Face TRL (2x faster training)

Model Overview

Qwen3-0.6B-SFT-Mixed-Reasoning is a supervised fine-tuned version of Qwen3-0.6B, specifically designed to improve multi-step mathematical reasoning while preserving the model's factual knowledge base. This model is a direct product of research into the "Alignment Tax" in Small Language Models — the tendency for fine-tuning methods to degrade pre-trained knowledge when pushing for improved reasoning behavior.

The key innovation here is the data mixing curriculum: instead of training purely on open-ended reasoning traces (which caused severe catastrophic forgetting in ablation experiments), this model was trained on a carefully balanced mixture of reasoning, math, and factual science data.

The Problem This Model Solves

In our initial ablation experiments, fine-tuning Qwen3-0.6B on a pure reasoning dataset (Opus 4.6, 500 steps, LoRA r=32) caused:

A 24.31% drop in ARC-Challenge (factual/science benchmark) accuracy
The model learned the structure of reasoning (<think> blocks, **Answer: B** formatting) but filled those blocks with overconfident hallucinations
Degenerate repetition loops frequently appeared during generation

The model had learned to look like it was reasoning without actually preserving its underlying knowledge — a classic manifestation of the alignment tax.

The Solution: Data Mixing Curriculum

By adopting a mixed-dataset approach at a fixed learning rate of 5e-5, the model was forced to simultaneously rehearse factual science knowledge while learning logical decomposition:

Dataset	Mix %	Purpose
`Opus 4.6 Reasoning`	50%	Teach structured `<think>` block reasoning and multi-step decomposition
`GSM8K`	25%	Anchor mathematical accuracy and arithmetic grounding
`ARC-Challenge`	25%	"Rehearsal" dataset to prevent catastrophic forgetting of factual science knowledge

Training Configuration

Parameter	Value
Base Model	`unsloth/qwen3-0.6b`
Method	Supervised Fine-Tuning (SFT) via LoRA
LoRA Rank (r)	32
Learning Rate	`5e-5`
Training Steps	~500
Data Mix	50% Opus + 25% GSM8K + 25% ARC
Repetition Penalty	`1.15` (to suppress degenerate loops)
Framework	Unsloth + TRL

Evaluation Results

Evaluated on a GSM8K subset (n=50) and ARC-Challenge benchmark against the base Qwen3-0.6B checkpoint:

Metric	Base Qwen3-0.6B	SFT (Opus Only)	SFT (Mixed — This Model)
GSM8K Accuracy	Baseline	Moderate gain	+6% absolute (+23% relative) ✅
ARC-Challenge	Baseline	-24.31% 💀	-2.56% ✅
Reasoning Style	None	Hallucinatory	Structured + Grounded

Key Insight

The +6% absolute accuracy improvement on GSM8K was primarily driven by the model's ability to correctly decompose multi-step arithmetic inside <think> blocks. The base model frequently hallucinated final values on 3-step problems. The mixed-SFT model correctly identified intermediate sub-problems before arriving at the final answer.

Limitations

Sample Size: GSM8K evaluation was conducted on a stratified subset (n=50). Full benchmark evaluation is in progress.
ARC Regression: Even with data mixing, a small -2.56% regression on ARC-Challenge was observed — suggesting that SFT on reasoning data always carries some residual alignment tax.
Scale: This model is 0.6B parameters. Results may not generalize to larger or smaller model families without re-tuning the data mix ratios.
Verbosity: Unlike GRPO-trained models, SFT models tend to produce longer <think> traces by imitating the verbose style of training data. For production use cases where inference cost is critical, consider the companion GRPO-optimized model.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Shreyansh327/Qwen3-0.6B-SFT-Mixed-Reasoning"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "A store sells apples for $1.50 each. If Maya buys 4 apples and pays with a $10 bill, how much change does she get?"
messages = [
    {"role": "system", "content": "Think through the problem carefully inside <think> tags, then provide your final answer inside <answer> tags."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    repetition_penalty=1.15
)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Shreyansh327
/

qwen3-0.6b-data-mixed