Qwen3-1.7B STEM Proof Distilled (DISC v3)

A 1.7B-parameter causal language model distilled from Qwen3-30B-A3B on 6,122 STEM chain-of-thought samples using discrepancy-informed knowledge distillation. The training objective emphasizes proof structure, detects reasoning pivot tokens through token-level divergence dynamics, smooths high-entropy student singularities before distillation, and monitors structural drift through discrepancy energy.

"Structure beats scale, collaboration beats hierarchy, observation beats theory."
— Convergent Intelligence LLC: Research Division

What Makes This Different

Standard knowledge distillation treats all tokens uniformly. Even proof-weighted approaches typically apply a static multiplier over the entire derivation span. That helps, but it still misses the internal structure of reasoning: some regions are smooth procedural continuation, while others are pivots where the derivation changes technique, introduces a key lemma, performs a non-obvious transformation, or closes a conceptual gap.

This model was trained with three discrepancy-informed operators applied directly to the training dynamics:

Discrepancy-Weighted KD via Token-Level KL Structure
The per-token KL divergence between teacher and student is treated as a sequence. Its discrete discrepancy operator identifies sharp local changes in divergence, corresponding to reasoning pivots. These jump-like tokens receive amplified KD weight automatically, without manual annotation.
DG-Limit Smoothing for High-Entropy Student Tokens
At tokens where the student’s entropy is unusually high, indicating unstable or incoherent local representation, student logits are replaced by a neighborhood average before KD is computed. This stabilizes gradient flow at token-level singularities.
Gap Energy Monitoring and Regularization
Discrepancy energy tracks structural divergence across the sequence independent of average token loss. If average loss improves while discrepancy energy rises, the model may be learning smooth easy tokens while degrading on hard reasoning transitions. This signal is logged throughout training and also enters the loss as a small regularizer.

On top of this, proof-weighted cross-entropy emphasizes derivation quality over answer formatting, with proof emphasis decaying from 2.5× to 1.5× over training.

Model Details

Attribute	Value
Architecture	Qwen3 causal language model
Parameters	~2,031M
Base model	Qwen/Qwen3-1.7B
Teacher model	Qwen/Qwen3-30B-A3B-Instruct-2507
Training context length	1024 tokens
Precision	bf16
License	Apache 2.0
Developer	Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Training

Methodology: Discrepancy-Informed Knowledge Distillation

The training objective combines three components:

1. Proof-Weighted Cross-Entropy

Standard autoregressive next-token prediction is applied over the full target sequence, but tokens inside the derivation span are given higher weight than surrounding prompt and answer tokens. The proof region is identified from Proof: to Final Answer: and mapped to token spans through the tokenizer. Proof emphasis decays linearly from 2.5× to 1.5× over the course of training.

2. Discrepancy-Weighted Knowledge Distillation

Teacher-student KL divergence is computed tokenwise across the sequence. The discrete discrepancy operator is then applied to that KL sequence:

Compute per-token KL divergence between student and teacher
Compute local discrepancy magnitude along the token axis
Classify tokens with unusually large discrepancy jumps as reasoning pivots
Assign amplified KD weight to those tokens
Keep smooth disagreement tokens at standard weight

This allows the student to spend more learning capacity on structural transitions rather than only average behavior.

3. Gap Energy Regularization

Discrepancy energy is computed from the squared discrepancy signal across valid tokens. It is logged as a monitoring signal and also contributes a small additive regularization term to the total loss, helping discourage structural degradation at reasoning pivots even when mean loss falls.

Combined Objective

The full objective is:

L = α_ce · CE_weighted + α_kd · KD_disc + λ · E_disc

with:
    •	α_ce = 0.55
    •	α_kd = 0.45
    •	λ = 0.02

Hyperparameters

Parameter	Value
Epochs	1
Total samples	6,122
Train samples	5,815
Eval samples	307
Batch size	1
Effective batch size	8 via gradient accumulation
Gradient accumulation	8
Optimizer	AdamW
Weight decay	0.01
Learning rate	1.5e-5
Minimum learning rate	1e-6
Scheduler	cosine decay with 30 warmup steps
Gradient clipping	1.0
Distillation temperature	2.0
Loss weights (CE / KD / E_disc)	0.55 / 0.45 / 0.02
Proof weight schedule	2.5 → 1.5
Jump amplifier	3.0×
Jump threshold	mean + 2σ over discrepancy signal
DG smoothing window	3 tokens
DG entropy threshold	mean + 1σ
Precision	bf16 autocast

Dataset

The model was trained on 6,122 STEM chain-of-thought samples merged from 10 domain-specific datasets:

Domain	Samples
Physics	2,254
Linear Algebra	667
Differential Equations	636
Electromagnetism	580
Mathematics	576
Engineering	574
Classical Mechanics	343
Theoretical Mechanics	307
Physiology	114
Molecular Biology	71

All datasets were sourced from 0xZee, merged, shuffled with seed 42, and split 95/5 into train and evaluation partitions.

Training Format

Each sample was formatted as:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntrun/Qwen3-1.7B-STEM-Proof-Distilled"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

prompt = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Prove that if f''(x) + f(x) = 0 for all x, then f(x) = A cos(x) + B sin(x) for some constants A, B.

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        temperature=1.0,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Planned GGUF Export

GGUF export for llama.cpp-style deployment is planned. Recommended targets include:

Quantization	Typical Use
Q4_K_M	Edge/mobile inference
Q5_K_M	Balanced quality/size
Q8_0	Higher-fidelity desktop inference
F16	Reference export

Example llama.cpp prompt format:

./llama-cli -m qwen3-1.7b-stem-proof.gguf \
  -p "Solve the following problem carefully and show a rigorous derivation.\n\nProblem:\nFind the eigenvalues of [[2,1],[1,2]].\n\nProof:\n" \
  -n 512 --temp 0.0

Prompt Format

For best results, use the same structure as training:

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem here]

Proof:

Intended Uses

This model is well-suited to:
    •	mathematical derivations and worked solutions
    •	proof-style explanation
    •	physics and engineering problem solving
    •	educational tutoring and STEM walkthroughs
    •	lightweight reasoning deployment where a much larger teacher would be too expensive
    •	generator components in verifier-generator or retrieval-augmented reasoning systems

Out-of-Scope Uses

This model is not a substitute for:
    •	formal proof verification
    •	symbolic theorem proving
    •	safety-critical engineering review
    •	medical or legal advice
    •	sole-authority decision support in high-stakes settings

Limitations

The model can still produce fluent but invalid derivations, omit assumptions, overgeneralize familiar proof templates, or confuse rigor with verbosity. Domain balance is uneven: physics, linear algebra, differential equations, and engineering are more represented than physiology and molecular biology. The 1024-token training context also limits performance on very long derivations.

This is a reasoning-oriented language model, not a symbolic algebra engine or formal verifier.

Technical Deep Dive

Discrepancy-Weighted KD

Let the per-token KL divergence between student and teacher define a sequence over token position. The discrete discrepancy operator is applied to that sequence to identify local jumps. Smooth tokens represent ordinary teacher-student disagreement. Jump tokens represent structural transitions where the student and teacher diverge sharply. These regions receive amplified KD weight.

Conceptually, this separates reasoning into:
    •	smooth regions, where the student tracks the teacher locally
    •	pivot regions, where the proof changes direction or introduces a critical inference
    •	singular/confused regions, where local student uncertainty is too high for stable pointwise KD

DG-Limit Smoothing

At tokens where the student’s entropy exceeds its local sequence baseline, logits are replaced by a local neighborhood average before distillation. This acts as a stabilization operator for singular high-uncertainty regions and prevents noisy tokenwise KD from dominating the gradient where the student has not yet formed a coherent local representation.

Gap Energy

Gap energy is computed from the squared discrepancy signal across valid tokens:

E_disc = 0.5 × mean(Df²)

This serves two roles:
a logged diagnostic for structural drift, and a small additive regularizer in the loss. It helps expose a failure mode where average loss improves but structural reasoning transitions degrade.

Why Temperature = 2.0

A higher KD temperature exposes more of the teacher’s uncertainty structure rather than only the argmax token path. In STEM reasoning, where multiple valid derivational continuations may exist, this helps transfer alternative local proof preferences and not just hard next-token imitation.

MoE Teacher → Dense Student

The teacher is a large mixture-of-experts model, while the student is a dense 1.7B model. Distillation transfers reasoning behavior from a high-capacity teacher into a cheaper deployment model without MoE routing overhead at inference time.

Theoretical Foundation

The discrepancy-informed operators used in this training pipeline are motivated by the broader Discrepancy Calculus (DISC) framework developed within Convergent Intelligence LLC’s research program.

In this context, discrepancy is treated as meaningful structure rather than noise. Applied to teacher-student divergence, this perspective motivates separating smooth disagreement from sharp transition points and treating unstable local regions with averaging-based stabilization rather than purely pointwise supervision.

Citation

@misc{colca2026discstemdistilled,
  title        = {Qwen3-1.7B STEM Proof Distilled (DISC v3)},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/reaperdoesntrun/Qwen3-1.7B-STEM-Proof-Distilled},
  note         = {Convergent Intelligence LLC: Research Division}
}

Acknowledgments

Training data from 0xZee’s STEM CoT dataset collection. Base architecture from Qwen. Discrepancy-informed training methodology developed within Convergent Intelligence LLC’s research program.

⸻

Convergent Intelligence LLC: Research Division
“Where classical analysis fails to see, we begin.”

---

## Convergent Intelligence Portfolio

*Part of the [Qwen3 1.7B Distillation Series](https://huggingface.co/reaperdoesntknow) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*


#
## Mathematical Foundations: Discrepancy Calculus (DISC)

This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.

Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165).

## Related Models

| Model | Downloads | Format |
|-------|-----------|--------|
| [Qwen3-1.7B-Distilled-30B-A3B-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT) | 65 | HF |
| [Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF) | 175 | GGUF |
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 | HF |

### Top Models from Our Lab

| Model | Downloads |
|-------|-----------|
| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 |
| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |
| [SMOLM2Prover-GGUF](https://huggingface.co/reaperdoesntknow/SMOLM2Prover-GGUF) | 150 |

**Total Portfolio: 41 models | 2,781 total downloads**


*Last updated: 2026-03-28 12:56 UTC*

<!-- DISTILQWEN-SPOTLIGHT-START -->

## DistilQwen Collection

This model is part of the **[DistilQwen](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)** proof-weighted distillation series.
Collection: **9 models** | **2,788 downloads**

### Teacher Variant Comparison

| Teacher | Student Size | Strength | Models |
|---------|-------------|----------|--------|
| Qwen3-30B-A3B (Instruct) | 1.7B | Instruction following, structured output, legal reasoning | 3 (833 DL) **← this model** |
| Qwen3-30B-A3B (Thinking) | 0.6B | Extended deliberation, higher-entropy distributions, proof derivation | 3 (779 DL) |
| Qwen3-30B-A3B (Coder) | 1.7B | Structured decomposition, STEM derivation, logical inference | 2 (825 DL) |

### Methodology

**The only BF16 collection in the portfolio.** While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.

All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.

Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)

### Related in this series

- [Qwen3-1.7B-Distilled-30B-A3B-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT) (252 downloads)
- [Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF) (289 downloads)

<!-- DISTILQWEN-SPOTLIGHT-END -->
<!-- cix-keeper-ts:2026-04-13T16:06:10Z -->
<!-- card-refresh: 2026-03-30 -->