Qwen3-0.6B STEM Proof Distilled (Thinking Teacher)

A 0.6B parameter model distilled from Qwen3-30B-A3B-Thinking on 6,122 STEM chain-of-thought samples. 50x parameter compression. The Thinking variant teacher produces richer extended reasoning traces than the Instruct variant, transferring deeper deliberation structure into the smallest possible student.

The result: a model under 500MB quantized that produces structured STEM derivations because a 30B thinking model showed it how to reason.

"Structure beats scale." — Convergent Intelligence LLC: Research Division

What Makes This Different

Two key differences from standard small-model distillation:

1. Thinking teacher, not Instruct teacher. The Qwen3-30B-A3B-Thinking variant generates extended internal reasoning before committing to an answer. Its softmax distributions are higher-entropy — it considers more reasoning paths at each step. At distillation temperature T=2.0, this means the 0.6B student sees a much richer landscape of alternative derivation strategies than it would from an Instruct teacher. The student doesn't just learn the answer — it learns the deliberation.

2. Proof-weighted loss. Tokens inside the derivation region (Proof: to Final Answer:) receive 2.5x amplified loss, decaying to 1.5x over training. The model is penalized more for errors in reasoning steps than for errors in answer formatting. At 0.6B, every parameter has to count — proof weighting ensures they're allocated to reasoning capability, not boilerplate reproduction.

Model Details

Attribute Value
Architecture Qwen3 (causal LM, RoPE, GQA)
Parameters 0.6B
Base model Qwen/Qwen3-0.6B
Teacher model Qwen/Qwen3-30B-A3B-Thinking-2507
Compression ratio 50x (30B → 0.6B)
Context length 1024 tokens (training)
Precision bf16
License Apache 2.0
Developer Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Training

Loss Function

  1. Proof-Weighted Cross-Entropy (55%) — Amplified weight on derivation tokens (2.5x → 1.5x linear decay)
  2. Knowledge Distillation KL Divergence (45%) — Student/teacher softmax divergence at T=2.0, scaled by T²

Combined: L = 0.55 * CE_weighted + 0.45 * KD_kl

Hyperparameters

Parameter Value
Epochs 1
Training samples 5,815 (95% of 6,122)
Eval samples 307 (5% held out)
Effective batch size 8
Optimizer AdamW (weight decay 0.01)
Learning rate 1.5e-5 → 1e-6 (cosine, 30-step warmup)
Gradient clipping 1.0
Temperature 2.0
Proof weight 2.5 → 1.5
Precision bf16

Dataset

6,122 STEM CoT samples from 12 domains (Physics 2,254 / Linear Algebra 667 / Differential Equations 636 / Electromagnetism 580 / Mathematics 576 / Engineering 574 / Classical Mechanics 343 / Theoretical Mechanics 307 / Advanced Calculus 268 / Modern Physics 177 / Physiology 114 / Molecular Biology 71). All from 0xZee.

Training Format

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-0.6B-STEM-Proof-Distilled-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

prompt = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Find the eigenvalues of the matrix [[3, 1], [0, 3]].

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Uses

Good for: Lightweight STEM reasoning on edge/mobile devices, educational tutoring, proof drafting, component in multi-model pipelines where a small fast reasoner is needed, IoT and embedded inference.

Not for: Formal proof verification, safety-critical analysis, medical or legal advice, or tasks requiring long-context reasoning beyond 1024 tokens.

Limitations

0.6B is a hard capacity constraint. The model will struggle with multi-step proofs requiring more than ~8 reasoning steps, complex multi-variable problems, or domains underrepresented in training data (molecular biology, physiology). It will sometimes generate plausible but incorrect intermediate steps. Always verify.

Mathematical Foundations: Discrepancy Calculus (DISC)

This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Related Models

Model Description
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT This model + legal SFT
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF Quantized for edge deployment
Qwen3-1.7B-STEM-Proof-Distilled Larger 1.7B variant (Instruct teacher)

Citation

@misc{colca2026distilled06b,
  title={Qwen3-0.6B STEM Proof Distilled: 50x Compression from a Thinking Teacher},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-STEM-Proof-Distilled-Thinking},
  note={Convergent Intelligence LLC: Research Division}
}

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."


Convergent Intelligence Portfolio

Part of the Qwen3 0.6B Distillation Series by Convergent Intelligence LLC: Research Division

Mathematical Foundations: Discrepancy Calculus (DISC)

This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Related Models

Top Models from Our Lab

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:56 UTC

DistilQwen Collection

This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads

Teacher Variant Comparison

Teacher Student Size Strength Models
Qwen3-30B-A3B (Instruct) 1.7B Instruction following, structured output, legal reasoning 3 (833 DL)
Qwen3-30B-A3B (Thinking) 0.6B Extended deliberation, higher-entropy distributions, proof derivation 3 (779 DL) ← this model
Qwen3-30B-A3B (Coder) 1.7B Structured decomposition, STEM derivation, logical inference 2 (825 DL)

Methodology

The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.

All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Related in this series

Downloads last month
4,427
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(799)
this model
Quantizations
1 model

Datasets used to train reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B

Collection including reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B