Qwen3-0.6B-SFT-Mixed-Reasoning

Developed by: Shreyansh327
License: apache-2.0
Finetuned from: unsloth/qwen3-0.6b
Training Framework: Unsloth + Hugging Face TRL (2x faster training)


Model Overview

Qwen3-0.6B-SFT-Mixed-Reasoning is a supervised fine-tuned version of Qwen3-0.6B, specifically designed to improve multi-step mathematical reasoning while preserving the model's factual knowledge base. This model is a direct product of research into the "Alignment Tax" in Small Language Models β€” the tendency for fine-tuning methods to degrade pre-trained knowledge when pushing for improved reasoning behavior.

The key innovation here is the data mixing curriculum: instead of training purely on open-ended reasoning traces (which caused severe catastrophic forgetting in ablation experiments), this model was trained on a carefully balanced mixture of reasoning, math, and factual science data.


The Problem This Model Solves

In our initial ablation experiments, fine-tuning Qwen3-0.6B on a pure reasoning dataset (Opus 4.6, 500 steps, LoRA r=32) caused:

  • A 24.31% drop in ARC-Challenge (factual/science benchmark) accuracy
  • The model learned the structure of reasoning (<think> blocks, **Answer: B** formatting) but filled those blocks with overconfident hallucinations
  • Degenerate repetition loops frequently appeared during generation

The model had learned to look like it was reasoning without actually preserving its underlying knowledge β€” a classic manifestation of the alignment tax.


The Solution: Data Mixing Curriculum

By adopting a mixed-dataset approach at a fixed learning rate of 5e-5, the model was forced to simultaneously rehearse factual science knowledge while learning logical decomposition:

Dataset Mix % Purpose
Opus 4.6 Reasoning 50% Teach structured <think> block reasoning and multi-step decomposition
GSM8K 25% Anchor mathematical accuracy and arithmetic grounding
ARC-Challenge 25% "Rehearsal" dataset to prevent catastrophic forgetting of factual science knowledge

Training Configuration

Parameter Value
Base Model unsloth/qwen3-0.6b
Method Supervised Fine-Tuning (SFT) via LoRA
LoRA Rank (r) 32
Learning Rate 5e-5
Training Steps ~500
Data Mix 50% Opus + 25% GSM8K + 25% ARC
Repetition Penalty 1.15 (to suppress degenerate loops)
Framework Unsloth + TRL

Evaluation Results

Evaluated on a GSM8K subset (n=50) and ARC-Challenge benchmark against the base Qwen3-0.6B checkpoint:

Metric Base Qwen3-0.6B SFT (Opus Only) SFT (Mixed β€” This Model)
GSM8K Accuracy Baseline Moderate gain +6% absolute (+23% relative) βœ…
ARC-Challenge Baseline -24.31% πŸ’€ -2.56% βœ…
Reasoning Style None Hallucinatory Structured + Grounded

Key Insight

The +6% absolute accuracy improvement on GSM8K was primarily driven by the model's ability to correctly decompose multi-step arithmetic inside <think> blocks. The base model frequently hallucinated final values on 3-step problems. The mixed-SFT model correctly identified intermediate sub-problems before arriving at the final answer.


Limitations

  • Sample Size: GSM8K evaluation was conducted on a stratified subset (n=50). Full benchmark evaluation is in progress.
  • ARC Regression: Even with data mixing, a small -2.56% regression on ARC-Challenge was observed β€” suggesting that SFT on reasoning data always carries some residual alignment tax.
  • Scale: This model is 0.6B parameters. Results may not generalize to larger or smaller model families without re-tuning the data mix ratios.
  • Verbosity: Unlike GRPO-trained models, SFT models tend to produce longer <think> traces by imitating the verbose style of training data. For production use cases where inference cost is critical, consider the companion GRPO-optimized model.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Shreyansh327/Qwen3-0.6B-SFT-Mixed-Reasoning"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "A store sells apples for $1.50 each. If Maya buys 4 apples and pays with a $10 bill, how much change does she get?"
messages = [
    {"role": "system", "content": "Think through the problem carefully inside <think> tags, then provide your final answer inside <answer> tags."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    repetition_penalty=1.15
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train Shreyansh327/qwen3-0.6b-data-mixed