Qwen3-1.7B-Distilled-30B-A3B-SFT
A 1.7B parameter model built in two stages: first, knowledge distillation from a 30B MoE teacher on 6,122 STEM chain-of-thought samples to establish a structured reasoning backbone; then, supervised fine-tuning on legal instruction data to layer domain knowledge and instruction-following capability on top of that backbone.
The hypothesis: teach the model how to reason first (distillation), then teach it what to reason about (SFT). The order matters — SFT on a base model teaches pattern matching, SFT on a distilled model teaches application of learned reasoning structures to new domains.
"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division
Training Pipeline
Stage 1: Knowledge Distillation (STEM Reasoning Backbone)
The base Qwen3-1.7B was distilled from Qwen3-30B-A3B-Instruct-2507, a Mixture-of-Experts model with 30B total parameters and ~3B active per token. The student learns from the full MoE knowledge but deploys as a simple dense model with no routing overhead.
Data: 6,122 STEM chain-of-thought samples merged from 12 domain-specific datasets, each containing structured (question, chain-of-thought derivation, final answer) triples:
| Domain | Samples |
|---|---|
| Physics | 2,254 |
| Linear Algebra | 667 |
| Differential Equations | 636 |
| Electromagnetism | 580 |
| Mathematics | 576 |
| Engineering | 574 |
| Classical Mechanics | 343 |
| Theoretical Mechanics | 307 |
| Advanced Calculus | 268 |
| Modern Physics | 177 |
| Physiology | 114 |
| Molecular Biology | 71 |
All datasets sourced from 0xZee. Shuffled with seed 42, split 95/5 train/eval.
Training format:
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{chain-of-thought derivation}
Final Answer:
{response}
Loss function — two components:
Proof-Weighted Cross-Entropy (55%): Next-token prediction with amplified weight on tokens inside the derivation region (
Proof:toFinal Answer:). Weight starts at 2.5x and decays linearly to 1.5x over training. This forces the student to prioritize learning the reasoning chain — the actual derivation steps — over memorizing answer formatting. The decay prevents a second failure mode: if proof weight stays too high, the model memorizes derivation templates rather than learning generalizable reasoning.Knowledge Distillation KL Divergence (45%): KL divergence between student and teacher softmax distributions at temperature T=2.0, scaled by T squared. At T=1.0, the student mostly sees the teacher's top-1 choice at each reasoning step. At T=2.0, it sees the teacher's full probability landscape — which alternative derivation paths were considered, which formulations were close but not chosen. For STEM reasoning where multiple valid proof strategies exist, this transfer of uncertainty structure is critical.
Combined: L = 0.55 * CE_weighted + 0.45 * KD_kl
Why this matters: Standard distillation treats all tokens equally, so the model can minimize loss by getting the answer format right while generating plausible-but-wrong intermediate steps. Proof-weighted loss inverts this — errors in the derivation are penalized more heavily than errors in the answer string. The model learns how to think, not how to look like it's thinking.
Stage 1 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Training samples | 5,815 |
| Eval samples | 307 |
| Effective batch size | 8 (1 × 8 gradient accumulation) |
| Optimizer | AdamW (weight decay 0.01) |
| Learning rate | 1.5e-5 → 1e-6 (cosine, 30-step warmup) |
| Gradient clipping | 1.0 |
| Temperature | 2.0 |
| Proof weight | 2.5 → 1.5 (linear decay) |
| Precision | bf16 |
Stage 1 output: reaperdoesntknow/Qwen3-1.7B-STEM-Proof-Distilled — a 1.7B model with a STEM reasoning backbone that produces structured, step-by-step derivations.
Stage 2: Supervised Fine-Tuning (Legal Domain + Instruction Following)
The distilled model from Stage 1 was then fine-tuned on Alignment-Lab-AI/Lawyer-Instruct using TRL's SFTTrainer.
Why legal after STEM? Legal reasoning shares structural DNA with mathematical reasoning: premise identification, logical chaining, handling of exceptions and edge cases, structured argumentation toward a conclusion. A model that learned to produce rigorous derivations in Stage 1 transfers that structure to legal analysis in Stage 2 — it doesn't just memorize legal templates, it applies the reasoning patterns it already learned to a new domain.
Data: Alignment-Lab-AI/Lawyer-Instruct — instruction/output pairs covering legal concepts, case analysis, statutory interpretation, and legal reasoning. Split 95/5 train/eval.
Training format:
### Instruction:
{instruction}
### Response:
{output}
Stage 2 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Effective batch size | 8 (2 × 4 gradient accumulation) |
| Optimizer | AdamW (weight decay 0.01) |
| Learning rate | 5e-6 (cosine, 30-step warmup) |
| Gradient clipping | 1.0 |
| Max sequence length | 1024 |
| Gradient checkpointing | Enabled |
| Precision | bf16 |
| Eval strategy | Every 200 steps |
| Best model selection | Lowest eval loss |
Key design choice: The Stage 2 learning rate (5e-6) is deliberately lower than Stage 1 (1.5e-5). The reasoning backbone from distillation is the foundation — SFT should layer new domain knowledge on top without destabilizing the structured reasoning patterns established in Stage 1. Too high a learning rate here would overwrite the distillation gains.
Model Details
| Attribute | Value |
|---|---|
| Architecture | Qwen3 (causal LM, RoPE, GQA) |
| Parameters | 2,031M (1.7B advertised) |
| Base model | Qwen/Qwen3-1.7B |
| Teacher model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Stage 1 data | 6,122 STEM CoT samples (12 datasets) |
| Stage 2 data | Alignment-Lab-AI/Lawyer-Instruct |
| Context length | 1024 tokens (training) |
| Precision | bf16 |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
Usage
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
# Legal instruction-following (Stage 2 format)
prompt = """### Instruction:
Explain the difference between negligence and strict liability in tort law, and provide an example of each.
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# STEM derivation (Stage 1 format — still works)
prompt = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Find the general solution to the differential equation y'' - 3y' + 2y = 0.
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GGUF
Quantized versions for local/edge deployment are available at reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF.
Prompt Formats
This model responds to two formats from its two training stages:
STEM derivation (Stage 1 — distillation):
Solve the following problem carefully and show a rigorous derivation.
Problem:
[Your math/physics/engineering problem]
Proof:
Instruction-following (Stage 2 — SFT):
### Instruction:
[Your question or task]
### Response:
Both formats are active. Use the STEM format for derivation-heavy problems. Use the instruction format for general questions, legal reasoning, and tasks that benefit from instruction-following behavior.
Intended Uses
Good for: Structured legal reasoning, STEM problem solving with step-by-step derivation, instruction-following across technical domains, educational tutoring, proof drafting, edge/mobile deployment via GGUF, component in multi-model pipelines, and retrieval-augmented workflows requiring lightweight reasoning capability.
Not for: Formal proof verification (use Lean/Coq/Isabelle), actual legal counsel (consult a licensed attorney), safety-critical engineering analysis, medical advice, or sole authority in high-stakes settings.
Limitations
This is a 1.7B model. It can produce fluent but incorrect reasoning in both STEM and legal domains — always verify critical outputs independently. It may overgeneralize from training templates, confuse thoroughness with accuracy, or perform unevenly across domains. The 1024 token training context limits performance on long derivations or complex multi-part legal questions. Legal knowledge reflects the training data and should not be treated as current or comprehensive legal guidance.
Stage 2 SFT may partially overwrite some Stage 1 capabilities on underrepresented STEM domains (molecular biology, physiology). Core reasoning structure on well-represented domains (physics, differential equations, linear algebra) should be preserved due to the conservative learning rate.
Mathematical Foundations: Discrepancy Calculus (DISC)
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).
Related Models
| Model | Description |
|---|---|
| Qwen3-1.7B-STEM-Proof-Distilled | Stage 1 only — pure STEM reasoning backbone without legal SFT |
| Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF | This model quantized to Q4_K_M / Q5_K_M / Q8_0 / F16 for llama.cpp |
Citation
@misc{colca2026distilledsft,
title={Two-Stage Reasoning Transfer: STEM Knowledge Distillation + Legal SFT
for Lightweight Structured Reasoning},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT},
note={Convergent Intelligence LLC: Research Division}
}
Acknowledgments
STEM CoT training data from 0xZee. Legal instruction data from Alignment-Lab-AI. Base architecture from Qwen. Distillation and training methodology developed as part of Convergent Intelligence LLC's research program on structured reasoning transfer.
Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."
Convergent Intelligence Portfolio
Part of the Qwen3 1.7B Distillation Series by Convergent Intelligence LLC: Research Division
Mathematical Foundations: Discrepancy Calculus (DISC)
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).
Related Models
| Model | Downloads | Format |
|---|---|---|
| Qwen3-1.7B-Distilled-30B-A3B | 96 | HF |
| Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF | 175 | GGUF |
| Qwen3-1.7B-Thinking-Distil | 501 | HF |
Top Models from Our Lab
| Model | Downloads |
|---|---|
| LFM2.5-1.2B-Distilled-SFT | 342 |
| Qwen3-1.7B-Coder-Distilled-SFT | 302 |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF | 203 |
| Qwen3-1.7B-Coder-Distilled-SFT-GGUF | 194 |
| SMOLM2Prover-GGUF | 150 |
Total Portfolio: 41 models | 2,781 total downloads
Last updated: 2026-03-28 12:56 UTC
DistilQwen Collection
This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads
Teacher Variant Comparison
| Teacher | Student Size | Strength | Models |
|---|---|---|---|
| Qwen3-30B-A3B (Instruct) | 1.7B | Instruction following, structured output, legal reasoning | 3 (833 DL) ← this model |
| Qwen3-30B-A3B (Thinking) | 0.6B | Extended deliberation, higher-entropy distributions, proof derivation | 3 (779 DL) |
| Qwen3-30B-A3B (Coder) | 1.7B | Structured decomposition, STEM derivation, logical inference | 2 (825 DL) |
Methodology
The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.
All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.
Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)
Related in this series
- Qwen3-1.7B-Distilled-30B-A3B (292 downloads)
- Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF (289 downloads)
- Downloads last month
- 1,109
Model tree for reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT
Base model
Qwen/Qwen3-1.7B-Base