Qwen3-1.7B-Distilled-30B-A3B-SFT

A 1.7B parameter model built in two stages: first, knowledge distillation from a 30B MoE teacher on 6,122 STEM chain-of-thought samples to establish a structured reasoning backbone; then, supervised fine-tuning on legal instruction data to layer domain knowledge and instruction-following capability on top of that backbone.

The hypothesis: teach the model how to reason first (distillation), then teach it what to reason about (SFT). The order matters — SFT on a base model teaches pattern matching, SFT on a distilled model teaches application of learned reasoning structures to new domains.

"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Knowledge Distillation (STEM Reasoning Backbone)

The base Qwen3-1.7B was distilled from Qwen3-30B-A3B-Instruct-2507, a Mixture-of-Experts model with 30B total parameters and ~3B active per token. The student learns from the full MoE knowledge but deploys as a simple dense model with no routing overhead.

Data: 6,122 STEM chain-of-thought samples merged from 12 domain-specific datasets, each containing structured (question, chain-of-thought derivation, final answer) triples:

Domain	Samples
Physics	2,254
Linear Algebra	667
Differential Equations	636
Electromagnetism	580
Mathematics	576
Engineering	574
Classical Mechanics	343
Theoretical Mechanics	307
Advanced Calculus	268
Modern Physics	177
Physiology	114
Molecular Biology	71

All datasets sourced from 0xZee. Shuffled with seed 42, split 95/5 train/eval.

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{chain-of-thought derivation}

Final Answer:
{response}

Loss function — two components:

Proof-Weighted Cross-Entropy (55%): Next-token prediction with amplified weight on tokens inside the derivation region (Proof: to Final Answer:). Weight starts at 2.5x and decays linearly to 1.5x over training. This forces the student to prioritize learning the reasoning chain — the actual derivation steps — over memorizing answer formatting. The decay prevents a second failure mode: if proof weight stays too high, the model memorizes derivation templates rather than learning generalizable reasoning.
Knowledge Distillation KL Divergence (45%): KL divergence between student and teacher softmax distributions at temperature T=2.0, scaled by T squared. At T=1.0, the student mostly sees the teacher's top-1 choice at each reasoning step. At T=2.0, it sees the teacher's full probability landscape — which alternative derivation paths were considered, which formulations were close but not chosen. For STEM reasoning where multiple valid proof strategies exist, this transfer of uncertainty structure is critical.

Combined: L = 0.55 * CE_weighted + 0.45 * KD_kl

Why this matters: Standard distillation treats all tokens equally, so the model can minimize loss by getting the answer format right while generating plausible-but-wrong intermediate steps. Proof-weighted loss inverts this — errors in the derivation are penalized more heavily than errors in the answer string. The model learns how to think, not how to look like it's thinking.

Stage 1 hyperparameters:

Parameter	Value
Epochs	1
Training samples	5,815
Eval samples	307
Effective batch size	8 (1 × 8 gradient accumulation)
Optimizer	AdamW (weight decay 0.01)
Learning rate	1.5e-5 → 1e-6 (cosine, 30-step warmup)
Gradient clipping	1.0
Temperature	2.0
Proof weight	2.5 → 1.5 (linear decay)
Precision	bf16

Stage 1 output: reaperdoesntknow/Qwen3-1.7B-STEM-Proof-Distilled — a 1.7B model with a STEM reasoning backbone that produces structured, step-by-step derivations.

Stage 2: Supervised Fine-Tuning (Legal Domain + Instruction Following)

The distilled model from Stage 1 was then fine-tuned on Alignment-Lab-AI/Lawyer-Instruct using TRL's SFTTrainer.

Why legal after STEM? Legal reasoning shares structural DNA with mathematical reasoning: premise identification, logical chaining, handling of exceptions and edge cases, structured argumentation toward a conclusion. A model that learned to produce rigorous derivations in Stage 1 transfers that structure to legal analysis in Stage 2 — it doesn't just memorize legal templates, it applies the reasoning patterns it already learned to a new domain.

Data: Alignment-Lab-AI/Lawyer-Instruct — instruction/output pairs covering legal concepts, case analysis, statutory interpretation, and legal reasoning. Split 95/5 train/eval.

Training format:

### Instruction:
{instruction}

### Response:
{output}

Stage 2 hyperparameters:

Parameter	Value
Epochs	1
Effective batch size	8 (2 × 4 gradient accumulation)
Optimizer	AdamW (weight decay 0.01)
Learning rate	5e-6 (cosine, 30-step warmup)
Gradient clipping	1.0
Max sequence length	1024
Gradient checkpointing	Enabled
Precision	bf16
Eval strategy	Every 200 steps
Best model selection	Lowest eval loss

Key design choice: The Stage 2 learning rate (5e-6) is deliberately lower than Stage 1 (1.5e-5). The reasoning backbone from distillation is the foundation — SFT should layer new domain knowledge on top without destabilizing the structured reasoning patterns established in Stage 1. Too high a learning rate here would overwrite the distillation gains.

Model Details

Attribute	Value
Architecture	Qwen3 (causal LM, RoPE, GQA)
Parameters	2,031M (1.7B advertised)
Base model	Qwen/Qwen3-1.7B
Teacher model	Qwen/Qwen3-30B-A3B-Instruct-2507
Stage 1 data	6,122 STEM CoT samples (12 datasets)
Stage 2 data	Alignment-Lab-AI/Lawyer-Instruct
Context length	1024 tokens (training)
Precision	bf16
License	Apache 2.0
Developer	Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Legal instruction-following (Stage 2 format)
prompt = """### Instruction:
Explain the difference between negligence and strict liability in tort law, and provide an example of each.

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# STEM derivation (Stage 1 format — still works)
prompt = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Find the general solution to the differential equation y'' - 3y' + 2y = 0.

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions for local/edge deployment are available at reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF.

Prompt Formats

This model responds to two formats from its two training stages:

STEM derivation (Stage 1 — distillation):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your math/physics/engineering problem]

Proof:

Instruction-following (Stage 2 — SFT):

### Instruction:
[Your question or task]

### Response:

Both formats are active. Use the STEM format for derivation-heavy problems. Use the instruction format for general questions, legal reasoning, and tasks that benefit from instruction-following behavior.

Intended Uses

Good for: Structured legal reasoning, STEM problem solving with step-by-step derivation, instruction-following across technical domains, educational tutoring, proof drafting, edge/mobile deployment via GGUF, component in multi-model pipelines, and retrieval-augmented workflows requiring lightweight reasoning capability.

Not for: Formal proof verification (use Lean/Coq/Isabelle), actual legal counsel (consult a licensed attorney), safety-critical engineering analysis, medical advice, or sole authority in high-stakes settings.

Limitations

This is a 1.7B model. It can produce fluent but incorrect reasoning in both STEM and legal domains — always verify critical outputs independently. It may overgeneralize from training templates, confuse thoroughness with accuracy, or perform unevenly across domains. The 1024 token training context limits performance on long derivations or complex multi-part legal questions. Legal knowledge reflects the training data and should not be treated as current or comprehensive legal guidance.

Stage 2 SFT may partially overwrite some Stage 1 capabilities on underrepresented STEM domains (molecular biology, physiology). Core reasoning structure on well-represented domains (physics, differential equations, linear algebra) should be preserved due to the conservative learning rate.

Mathematical Foundations: Discrepancy Calculus (DISC)

This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Related Models

Model	Description
Qwen3-1.7B-STEM-Proof-Distilled	Stage 1 only — pure STEM reasoning backbone without legal SFT
Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF	This model quantized to Q4_K_M / Q5_K_M / Q8_0 / F16 for llama.cpp

Citation

@misc{colca2026distilledsft,
  title={Two-Stage Reasoning Transfer: STEM Knowledge Distillation + Legal SFT 
         for Lightweight Structured Reasoning},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT},
  note={Convergent Intelligence LLC: Research Division}
}

Acknowledgments

STEM CoT training data from 0xZee. Legal instruction data from Alignment-Lab-AI. Base architecture from Qwen. Distillation and training methodology developed as part of Convergent Intelligence LLC's research program on structured reasoning transfer.

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."

Convergent Intelligence Portfolio

Part of the Qwen3 1.7B Distillation Series by Convergent Intelligence LLC: Research Division

Mathematical Foundations: Discrepancy Calculus (DISC)

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Related Models

Model	Downloads	Format
Qwen3-1.7B-Distilled-30B-A3B	96	HF
Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF	175	GGUF
Qwen3-1.7B-Thinking-Distil	501	HF

Top Models from Our Lab

Model	Downloads
LFM2.5-1.2B-Distilled-SFT	342
Qwen3-1.7B-Coder-Distilled-SFT	302
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	203
Qwen3-1.7B-Coder-Distilled-SFT-GGUF	194
SMOLM2Prover-GGUF	150

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:56 UTC

DistilQwen Collection

This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads

Teacher Variant Comparison

Teacher	Student Size	Strength	Models
Qwen3-30B-A3B (Instruct)	1.7B	Instruction following, structured output, legal reasoning	3 (833 DL) ← this model
Qwen3-30B-A3B (Thinking)	0.6B	Extended deliberation, higher-entropy distributions, proof derivation	3 (779 DL)
Qwen3-30B-A3B (Coder)	1.7B	Structured decomposition, STEM derivation, logical inference	2 (825 DL)

Methodology

The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.

All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)