Qwen3-1.7B STEM Proof Distilled (DISC v3)
A 1.7B-parameter causal language model distilled from Qwen3-30B-A3B on 6,122 STEM chain-of-thought samples using discrepancy-informed knowledge distillation. The training objective emphasizes proof structure, detects reasoning pivot tokens through token-level divergence dynamics, smooths high-entropy student singularities before distillation, and monitors structural drift through discrepancy energy.
"Structure beats scale, collaboration beats hierarchy, observation beats theory."
— Convergent Intelligence LLC: Research Division
What Makes This Different
Standard knowledge distillation treats all tokens uniformly. Even proof-weighted approaches typically apply a static multiplier over the entire derivation span. That helps, but it still misses the internal structure of reasoning: some regions are smooth procedural continuation, while others are pivots where the derivation changes technique, introduces a key lemma, performs a non-obvious transformation, or closes a conceptual gap.
This model was trained with three discrepancy-informed operators applied directly to the training dynamics:
Discrepancy-Weighted KD via Token-Level KL Structure
The per-token KL divergence between teacher and student is treated as a sequence. Its discrete discrepancy operator identifies sharp local changes in divergence, corresponding to reasoning pivots. These jump-like tokens receive amplified KD weight automatically, without manual annotation.DG-Limit Smoothing for High-Entropy Student Tokens
At tokens where the student’s entropy is unusually high, indicating unstable or incoherent local representation, student logits are replaced by a neighborhood average before KD is computed. This stabilizes gradient flow at token-level singularities.Gap Energy Monitoring and Regularization
Discrepancy energy tracks structural divergence across the sequence independent of average token loss. If average loss improves while discrepancy energy rises, the model may be learning smooth easy tokens while degrading on hard reasoning transitions. This signal is logged throughout training and also enters the loss as a small regularizer.
On top of this, proof-weighted cross-entropy emphasizes derivation quality over answer formatting, with proof emphasis decaying from 2.5× to 1.5× over training.
Model Details
| Attribute | Value |
|---|---|
| Architecture | Qwen3 causal language model |
| Parameters | ~2,031M |
| Base model | Qwen/Qwen3-1.7B |
| Teacher model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Training context length | 1024 tokens |
| Precision | bf16 |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
Training
Methodology: Discrepancy-Informed Knowledge Distillation
The training objective combines three components:
1. Proof-Weighted Cross-Entropy
Standard autoregressive next-token prediction is applied over the full target sequence, but tokens inside the derivation span are given higher weight than surrounding prompt and answer tokens. The proof region is identified from Proof: to Final Answer: and mapped to token spans through the tokenizer. Proof emphasis decays linearly from 2.5× to 1.5× over the course of training.
2. Discrepancy-Weighted Knowledge Distillation
Teacher-student KL divergence is computed tokenwise across the sequence. The discrete discrepancy operator is then applied to that KL sequence:
- Compute per-token KL divergence between student and teacher
- Compute local discrepancy magnitude along the token axis
- Classify tokens with unusually large discrepancy jumps as reasoning pivots
- Assign amplified KD weight to those tokens
- Keep smooth disagreement tokens at standard weight
This allows the student to spend more learning capacity on structural transitions rather than only average behavior.
3. Gap Energy Regularization
Discrepancy energy is computed from the squared discrepancy signal across valid tokens. It is logged as a monitoring signal and also contributes a small additive regularization term to the total loss, helping discourage structural degradation at reasoning pivots even when mean loss falls.
Combined Objective
The full objective is:
L = α_ce · CE_weighted + α_kd · KD_disc + λ · E_disc
with:
• α_ce = 0.55
• α_kd = 0.45
• λ = 0.02
Hyperparameters
Parameter Value
Epochs 1
Total samples 6,122
Train samples 5,815
Eval samples 307
Batch size 1
Effective batch size 8 via gradient accumulation
Gradient accumulation 8
Optimizer AdamW
Weight decay 0.01
Learning rate 1.5e-5
Minimum learning rate 1e-6
Scheduler cosine decay with 30 warmup steps
Gradient clipping 1.0
Distillation temperature 2.0
Loss weights (CE / KD / E_disc) 0.55 / 0.45 / 0.02
Proof weight schedule 2.5 → 1.5
Jump amplifier 3.0×
Jump threshold mean + 2σ over discrepancy signal
DG smoothing window 3 tokens
DG entropy threshold mean + 1σ
Precision bf16 autocast
Dataset
The model was trained on 6,122 STEM chain-of-thought samples merged from 10 domain-specific datasets:
Domain Samples
Physics 2,254
Linear Algebra 667
Differential Equations 636
Electromagnetism 580
Mathematics 576
Engineering 574
Classical Mechanics 343
Theoretical Mechanics 307
Physiology 114
Molecular Biology 71
All datasets were sourced from 0xZee, merged, shuffled with seed 42, and split 95/5 into train and evaluation partitions.
Training Format
Each sample was formatted as:
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{CoT}
Final Answer:
{response}
Usage
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntrun/Qwen3-1.7B-STEM-Proof-Distilled"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
prompt = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Prove that if f''(x) + f(x) = 0 for all x, then f(x) = A cos(x) + B sin(x) for some constants A, B.
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
temperature=1.0,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Planned GGUF Export
GGUF export for llama.cpp-style deployment is planned. Recommended targets include:
Quantization Typical Use
Q4_K_M Edge/mobile inference
Q5_K_M Balanced quality/size
Q8_0 Higher-fidelity desktop inference
F16 Reference export
Example llama.cpp prompt format:
./llama-cli -m qwen3-1.7b-stem-proof.gguf \
-p "Solve the following problem carefully and show a rigorous derivation.\n\nProblem:\nFind the eigenvalues of [[2,1],[1,2]].\n\nProof:\n" \
-n 512 --temp 0.0
Prompt Format
For best results, use the same structure as training:
Solve the following problem carefully and show a rigorous derivation.
Problem:
[Your problem here]
Proof:
Intended Uses
This model is well-suited to:
• mathematical derivations and worked solutions
• proof-style explanation
• physics and engineering problem solving
• educational tutoring and STEM walkthroughs
• lightweight reasoning deployment where a much larger teacher would be too expensive
• generator components in verifier-generator or retrieval-augmented reasoning systems
Out-of-Scope Uses
This model is not a substitute for:
• formal proof verification
• symbolic theorem proving
• safety-critical engineering review
• medical or legal advice
• sole-authority decision support in high-stakes settings
Limitations
The model can still produce fluent but invalid derivations, omit assumptions, overgeneralize familiar proof templates, or confuse rigor with verbosity. Domain balance is uneven: physics, linear algebra, differential equations, and engineering are more represented than physiology and molecular biology. The 1024-token training context also limits performance on very long derivations.
This is a reasoning-oriented language model, not a symbolic algebra engine or formal verifier.
Technical Deep Dive
Discrepancy-Weighted KD
Let the per-token KL divergence between student and teacher define a sequence over token position. The discrete discrepancy operator is applied to that sequence to identify local jumps. Smooth tokens represent ordinary teacher-student disagreement. Jump tokens represent structural transitions where the student and teacher diverge sharply. These regions receive amplified KD weight.
Conceptually, this separates reasoning into:
• smooth regions, where the student tracks the teacher locally
• pivot regions, where the proof changes direction or introduces a critical inference
• singular/confused regions, where local student uncertainty is too high for stable pointwise KD
DG-Limit Smoothing
At tokens where the student’s entropy exceeds its local sequence baseline, logits are replaced by a local neighborhood average before distillation. This acts as a stabilization operator for singular high-uncertainty regions and prevents noisy tokenwise KD from dominating the gradient where the student has not yet formed a coherent local representation.
Gap Energy
Gap energy is computed from the squared discrepancy signal across valid tokens:
E_disc = 0.5 × mean(Df²)
This serves two roles:
a logged diagnostic for structural drift, and a small additive regularizer in the loss. It helps expose a failure mode where average loss improves but structural reasoning transitions degrade.
Why Temperature = 2.0
A higher KD temperature exposes more of the teacher’s uncertainty structure rather than only the argmax token path. In STEM reasoning, where multiple valid derivational continuations may exist, this helps transfer alternative local proof preferences and not just hard next-token imitation.
MoE Teacher → Dense Student
The teacher is a large mixture-of-experts model, while the student is a dense 1.7B model. Distillation transfers reasoning behavior from a high-capacity teacher into a cheaper deployment model without MoE routing overhead at inference time.
Theoretical Foundation
The discrepancy-informed operators used in this training pipeline are motivated by the broader Discrepancy Calculus (DISC) framework developed within Convergent Intelligence LLC’s research program.
In this context, discrepancy is treated as meaningful structure rather than noise. Applied to teacher-student divergence, this perspective motivates separating smooth disagreement from sharp transition points and treating unstable local regions with averaging-based stabilization rather than purely pointwise supervision.
Citation
@misc{colca2026discstemdistilled,
title = {Qwen3-1.7B STEM Proof Distilled (DISC v3)},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/reaperdoesntrun/Qwen3-1.7B-STEM-Proof-Distilled},
note = {Convergent Intelligence LLC: Research Division}
}
Acknowledgments
Training data from 0xZee’s STEM CoT dataset collection. Base architecture from Qwen. Discrepancy-informed training methodology developed within Convergent Intelligence LLC’s research program.
⸻
Convergent Intelligence LLC: Research Division
“Where classical analysis fails to see, we begin.”
---
## Convergent Intelligence Portfolio
*Part of the [Qwen3 1.7B Distillation Series](https://huggingface.co/reaperdoesntknow) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*
#
## Mathematical Foundations: Discrepancy Calculus (DISC)
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165).
## Related Models
| Model | Downloads | Format |
|-------|-----------|--------|
| [Qwen3-1.7B-Distilled-30B-A3B-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT) | 65 | HF |
| [Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF) | 175 | GGUF |
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 | HF |
### Top Models from Our Lab
| Model | Downloads |
|-------|-----------|
| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 |
| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |
| [SMOLM2Prover-GGUF](https://huggingface.co/reaperdoesntknow/SMOLM2Prover-GGUF) | 150 |
**Total Portfolio: 41 models | 2,781 total downloads**
*Last updated: 2026-03-28 12:56 UTC*
<!-- DISTILQWEN-SPOTLIGHT-START -->
## DistilQwen Collection
This model is part of the **[DistilQwen](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)** proof-weighted distillation series.
Collection: **9 models** | **2,788 downloads**
### Teacher Variant Comparison
| Teacher | Student Size | Strength | Models |
|---------|-------------|----------|--------|
| Qwen3-30B-A3B (Instruct) | 1.7B | Instruction following, structured output, legal reasoning | 3 (833 DL) **← this model** |
| Qwen3-30B-A3B (Thinking) | 0.6B | Extended deliberation, higher-entropy distributions, proof derivation | 3 (779 DL) |
| Qwen3-30B-A3B (Coder) | 1.7B | Structured decomposition, STEM derivation, logical inference | 2 (825 DL) |
### Methodology
**The only BF16 collection in the portfolio.** While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.
All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.
Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
### Related in this series
- [Qwen3-1.7B-Distilled-30B-A3B-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT) (252 downloads)
- [Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF) (289 downloads)
<!-- DISTILQWEN-SPOTLIGHT-END -->
<!-- cix-keeper-ts:2026-04-13T16:06:10Z -->
<!-- card-refresh: 2026-03-30 -->
- Downloads last month
- 4,580