What if a tiny model could learn not just what the right answer is, but how confident a much larger model is about every possible answer? That is the core idea behind knowledge distillation, and it is one of the most powerful tools we have for making small models punch above their weight.
The problem with hard labels
When you train a model normally, it learns from hard labels: the correct answer is class A, everything else is wrong. Simple. But that throws away a lot of useful information. When a large model predicts the next token, it does not just pick one winner. It produces a full probability distribution over the entire vocabulary. Maybe it gives 60% confidence to the word "learning", 15% to "understanding", 10% to "knowledge". Those soft probabilities carry structure that a hard label never could.
Training on hard labels is like giving a student only the answer key. Knowledge distillation is like giving them the teacher's thought process too.
How it works
The setup has two models: a large pretrained teacher and a smaller student that we want to train. Instead of training the student only on ground truth labels, we also train it to match the teacher's output distribution. The student loss becomes a combination of two things: the standard cross-entropy against the real data, and a distillation loss against the teacher's soft outputs.
GPT-2, Llama 3, etc. billions of params
P(token | context) full distribution
Supra Mini, etc. millions of params
The distillation loss is usually KL divergence between the teacher and student distributions. A temperature parameter T is applied to soften both distributions before computing the loss, which helps the student focus on the relative ordering of probabilities rather than just the peak.
+ α × T² × KL(teacher_soft || student_soft)
T = temperature (typically 2.0 to 4.0)
α = weight of distillation vs hard label loss
Why temperature matters
At T=1, the teacher's distribution is sharp: the top token gets most of the probability. At higher temperatures, the distribution flattens out and the student gets to see more signal about which tokens are "almost right". This is especially useful for rare tokens that the model almost never picks but that still carry meaningful relationships. The T² factor in the loss formula compensates for the scale change that temperature introduces.
Types of distillation
There are a few different flavors worth knowing about:
- Output distillation (classic): student matches the teacher's final token probabilities. This is the original Hinton et al. approach and still the most common.
- Feature distillation: student also learns to match the teacher's internal hidden states, not just the output. More expensive but can transfer deeper representations.
- Sequence-level distillation: instead of token-level matching, the student learns from full sequences generated by the teacher. Works well for tasks like translation.
- Data-free distillation: no original training data needed. The teacher generates synthetic data that is used to train the student. Very useful when the original dataset is proprietary.
Real world results
The most famous example is DistilBERT (2019), which kept 97% of BERT's performance at 40% of the size and 60% of the speed. More recently, distillation is a core part of how models like Phi-2 and Gemma achieve strong results at small scales despite training on relatively little data compared to frontier models.
| Model | Params | Teacher | Performance retained |
|---|---|---|---|
| DistilBERT | 66M | BERT (110M) | ~97% on GLUE |
| DistilGPT-2 | 82M | GPT-2 (124M) | ~90% perplexity |
| TinyLlama | 1.1B | Llama 2 (7B) | competitive at scale |
What this means for Supra Mini
Right now, all our Supra Mini models are trained from scratch on raw text. Distillation is one of the experiments on our roadmap. At 8M parameters, the student has very limited capacity, so picking the right teacher and the right distillation targets is going to matter a lot. Too large a teacher and the gap becomes impossible to bridge. Too much weight on the distillation loss and the student stops learning from the actual data.
It is a balancing act, and we are going to try it. If it works at our scale, the gains could be significant. A Supra Mini that learns from a Llama 3 teacher instead of raw text alone could be a very different model.
The catch
Distillation is not free. Running a teacher model forward pass for every training batch adds significant compute cost, especially if the teacher is large. For a project training on consumer hardware, that overhead is real. There are ways to pre-compute teacher logits and cache them, but that requires disk space proportional to the dataset size times the vocab size, which adds up fast at 5B tokens.
The other catch: distillation only works as well as the teacher. If the teacher has biases or blind spots, the student inherits them. You are not getting GPT-4 quality by distilling into an 8M model. You are getting a more efficient version of whatever the teacher actually learned.
Final thought
Knowledge distillation is one of the most elegant ideas in deep learning. Instead of training a small model to memorize answers, you train it to think like a bigger one. For tiny models like ours, it might be the key to breaking through the ceiling that raw pretraining alone cannot reach.
SupraLabs_