Qwen3-0.6B STEM Proof Distilled (Thinking Teacher)
A 0.6B parameter model distilled from Qwen3-30B-A3B-Thinking on 6,122 STEM chain-of-thought samples. 50x parameter compression. The Thinking variant teacher produces richer extended reasoning traces than the Instruct variant, transferring deeper deliberation structure into the smallest possible student.
The result: a model under 500MB quantized that produces structured STEM derivations because a 30B thinking model showed it how to reason.
"Structure beats scale." — Convergent Intelligence LLC: Research Division
What Makes This Different
Two key differences from standard small-model distillation:
1. Thinking teacher, not Instruct teacher. The Qwen3-30B-A3B-Thinking variant generates extended internal reasoning before committing to an answer. Its softmax distributions are higher-entropy — it considers more reasoning paths at each step. At distillation temperature T=2.0, this means the 0.6B student sees a much richer landscape of alternative derivation strategies than it would from an Instruct teacher. The student doesn't just learn the answer — it learns the deliberation.
2. Proof-weighted loss. Tokens inside the derivation region (Proof: to Final Answer:) receive 2.5x amplified loss, decaying to 1.5x over training. The model is penalized more for errors in reasoning steps than for errors in answer formatting. At 0.6B, every parameter has to count — proof weighting ensures they're allocated to reasoning capability, not boilerplate reproduction.
Model Details
| Attribute | Value |
|---|---|
| Architecture | Qwen3 (causal LM, RoPE, GQA) |
| Parameters | 0.6B |
| Base model | Qwen/Qwen3-0.6B |
| Teacher model | Qwen/Qwen3-30B-A3B-Thinking-2507 |
| Compression ratio | 50x (30B → 0.6B) |
| Context length | 1024 tokens (training) |
| Precision | bf16 |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
Training
Loss Function
- Proof-Weighted Cross-Entropy (55%) — Amplified weight on derivation tokens (2.5x → 1.5x linear decay)
- Knowledge Distillation KL Divergence (45%) — Student/teacher softmax divergence at T=2.0, scaled by T²
Combined: L = 0.55 * CE_weighted + 0.45 * KD_kl
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Training samples | 5,815 (95% of 6,122) |
| Eval samples | 307 (5% held out) |
| Effective batch size | 8 |
| Optimizer | AdamW (weight decay 0.01) |
| Learning rate | 1.5e-5 → 1e-6 (cosine, 30-step warmup) |
| Gradient clipping | 1.0 |
| Temperature | 2.0 |
| Proof weight | 2.5 → 1.5 |
| Precision | bf16 |
Dataset
6,122 STEM CoT samples from 12 domains (Physics 2,254 / Linear Algebra 667 / Differential Equations 636 / Electromagnetism 580 / Mathematics 576 / Engineering 574 / Classical Mechanics 343 / Theoretical Mechanics 307 / Advanced Calculus 268 / Modern Physics 177 / Physiology 114 / Molecular Biology 71). All from 0xZee.
Training Format
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{CoT}
Final Answer:
{response}
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntknow/Qwen3-0.6B-STEM-Proof-Distilled-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
prompt = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Find the eigenvalues of the matrix [[3, 1], [0, 3]].
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Uses
Good for: Lightweight STEM reasoning on edge/mobile devices, educational tutoring, proof drafting, component in multi-model pipelines where a small fast reasoner is needed, IoT and embedded inference.
Not for: Formal proof verification, safety-critical analysis, medical or legal advice, or tasks requiring long-context reasoning beyond 1024 tokens.
Limitations
0.6B is a hard capacity constraint. The model will struggle with multi-step proofs requiring more than ~8 reasoning steps, complex multi-variable problems, or domains underrepresented in training data (molecular biology, physiology). It will sometimes generate plausible but incorrect intermediate steps. Always verify.
Mathematical Foundations: Discrepancy Calculus (DISC)
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).
Related Models
| Model | Description |
|---|---|
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT | This model + legal SFT |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF | Quantized for edge deployment |
| Qwen3-1.7B-STEM-Proof-Distilled | Larger 1.7B variant (Instruct teacher) |
Citation
@misc{colca2026distilled06b,
title={Qwen3-0.6B STEM Proof Distilled: 50x Compression from a Thinking Teacher},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-STEM-Proof-Distilled-Thinking},
note={Convergent Intelligence LLC: Research Division}
}
Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."
Convergent Intelligence Portfolio
Part of the Qwen3 0.6B Distillation Series by Convergent Intelligence LLC: Research Division
Mathematical Foundations: Discrepancy Calculus (DISC)
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).
Related Models
| Model | Downloads | Format |
|---|---|---|
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT | 33 | HF |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF | 203 | GGUF |
Top Models from Our Lab
| Model | Downloads |
|---|---|
| Qwen3-1.7B-Thinking-Distil | 501 |
| LFM2.5-1.2B-Distilled-SFT | 342 |
| Qwen3-1.7B-Coder-Distilled-SFT | 302 |
| Qwen3-1.7B-Coder-Distilled-SFT-GGUF | 194 |
| Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF | 175 |
Total Portfolio: 41 models | 2,781 total downloads
Last updated: 2026-03-28 12:56 UTC
DistilQwen Collection
This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads
Teacher Variant Comparison
| Teacher | Student Size | Strength | Models |
|---|---|---|---|
| Qwen3-30B-A3B (Instruct) | 1.7B | Instruction following, structured output, legal reasoning | 3 (833 DL) |
| Qwen3-30B-A3B (Thinking) | 0.6B | Extended deliberation, higher-entropy distributions, proof derivation | 3 (779 DL) ← this model |
| Qwen3-30B-A3B (Coder) | 1.7B | Structured decomposition, STEM derivation, logical inference | 2 (825 DL) |
Methodology
The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.
All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.
Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)
Related in this series
- Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT (227 downloads)
- Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF (316 downloads)
- Downloads last month
- 4,427