Blog / knowledge-distillation.html
AxionLab-official's picture
Create knowledge-distillation.html
79e2418 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Knowledge Distillation | SupraLabs Blog</title>
<style>
:root {
--bg: #0f0f0f;
--surface: #1a1a1a;
--border: #333;
--text: #e0e0e0;
--accent: #536bfe;
--muted: #888;
--font-mono: 'JetBrains Mono', 'Fira Code', monospace;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
background-color: var(--bg);
color: var(--text);
font-family: 'Inter', -apple-system, sans-serif;
line-height: 1.6;
padding: 2rem;
}
code, pre, .mono { font-family: var(--font-mono); }
.container { max-width: 900px; margin: 0 auto; }
header {
border-bottom: 2px solid var(--border);
padding-bottom: 2rem;
margin-bottom: 3rem;
display: flex;
justify-content: space-between;
align-items: flex-end;
}
.logo-area h1 {
font-size: 1.2rem;
text-transform: uppercase;
letter-spacing: 2px;
color: var(--accent);
line-height: 1;
display: flex;
align-items: center;
gap: 10px;
}
.logo-area a { text-decoration: none; color: inherit; }
.logo-area {
display: flex;
align-items: center;
gap: 10px;
font-weight: bold;
font-size: 1.2rem;
}
nav a {
color: var(--text);
text-decoration: none;
margin-left: 1.5rem;
font-size: 0.9rem;
border-bottom: 1px solid transparent;
}
nav a:hover { border-bottom: 1px solid var(--accent); }
.post-header { margin-bottom: 3rem; }
.post-header h2 {
font-size: 3rem;
line-height: 1.1;
margin-bottom: 1rem;
font-weight: 800;
}
.post-meta {
font-family: var(--font-mono);
color: var(--accent);
font-size: 0.9rem;
margin-bottom: 2rem;
}
.post-content {
background: var(--surface);
border: 1px solid var(--border);
padding: 3rem;
margin-bottom: 4rem;
}
.post-content h2 {
font-size: 1.8rem;
margin: 2.5rem 0 1rem 0;
color: var(--accent);
}
.post-content h2:first-child { margin-top: 0; }
.post-content p {
margin-bottom: 1.5rem;
font-size: 1.1rem;
color: var(--text);
}
.post-content ul {
margin-bottom: 1.5rem;
padding-left: 1.5rem;
}
.post-content li { margin-bottom: 0.5rem; font-size: 1.1rem; }
.post-content strong { color: #fff; }
.post-content code {
background: #111;
border: 1px solid var(--border);
padding: 2px 6px;
border-radius: 3px;
font-size: 0.95em;
color: var(--accent);
}
.callout {
border-left: 3px solid var(--accent);
background: #111;
padding: 1rem 1.5rem;
margin: 2rem 0;
font-family: var(--font-mono);
font-size: 0.95rem;
color: #ccc;
}
.callout span {
display: block;
color: var(--muted);
font-size: 0.8rem;
margin-bottom: 0.4rem;
}
.table-wrap { overflow-x: auto; margin: 2rem 0; }
table {
width: 100%;
border-collapse: collapse;
font-family: var(--font-mono);
font-size: 0.9rem;
}
th {
background: #111;
color: var(--accent);
padding: 0.75rem 1rem;
text-align: left;
border: 1px solid var(--border);
}
td {
padding: 0.7rem 1rem;
border: 1px solid var(--border);
color: var(--text);
}
tr:nth-child(even) td { background: #111; }
/* Diagram boxes */
.diagram {
display: flex;
align-items: center;
justify-content: center;
gap: 1rem;
margin: 2rem 0;
flex-wrap: wrap;
}
.diagram-box {
background: #111;
border: 1px solid var(--border);
padding: 1.2rem 1.8rem;
text-align: center;
font-family: var(--font-mono);
font-size: 0.85rem;
}
.diagram-box.teacher {
border-color: var(--accent);
color: var(--accent);
}
.diagram-box.student {
border-color: #888;
color: #ccc;
}
.diagram-box small {
display: block;
color: var(--muted);
font-size: 0.7rem;
margin-top: 0.3rem;
}
.diagram-arrow {
color: var(--accent);
font-size: 1.5rem;
font-family: var(--font-mono);
}
.tags { display: flex; gap: 0.5rem; margin-top: 2rem; flex-wrap: wrap; }
.tag {
font-family: var(--font-mono);
font-size: 0.7rem;
padding: 2px 8px;
border: 1px solid var(--border);
border-radius: 4px;
color: var(--muted);
}
footer {
margin-top: 6rem;
padding-bottom: 2rem;
font-size: 0.8rem;
color: var(--muted);
text-align: center;
}
@media (max-width: 600px) {
.post-header h2 { font-size: 2rem; }
.post-content { padding: 1.5rem; }
header { flex-direction: column; align-items: flex-start; gap: 1rem; }
nav a { margin-left: 0; margin-right: 1rem; }
.diagram { flex-direction: column; }
.diagram-arrow { transform: rotate(90deg); }
}
</style>
</head>
<body>
<div class="container">
<header>
<div class="logo-area" style="font-size: 1.5em;">
<a href="./index.html"><h1><img src="./image.png" style="height: 2em"> SupraLabs_</h1></a>
</div>
<nav>
<a href="./index.html#news">News</a>
<a href="https://huggingface.co/SupraLabs" target="blank">HuggingFace</a>
<a href="./index.html#hardware">Hardware</a>
</nav>
</header>
<article>
<div class="post-header">
<div class="post-meta">// 2026-05-19 | Research</div>
<h2>Knowledge Distillation:<br>teaching small models<br>to think big.</h2>
</div>
<div class="post-content">
<p>What if a tiny model could learn not just <em>what</em> the right answer is, but <em>how confident</em> a much larger model is about every possible answer? That is the core idea behind knowledge distillation, and it is one of the most powerful tools we have for making small models punch above their weight.</p>
<h2>The problem with hard labels</h2>
<p>When you train a model normally, it learns from hard labels: the correct answer is class A, everything else is wrong. Simple. But that throws away a lot of useful information. When a large model predicts the next token, it does not just pick one winner. It produces a full probability distribution over the entire vocabulary. Maybe it gives 60% confidence to the word "learning", 15% to "understanding", 10% to "knowledge". Those <strong>soft probabilities carry structure</strong> that a hard label never could.</p>
<p>Training on hard labels is like giving a student only the answer key. Knowledge distillation is like giving them the teacher's thought process too.</p>
<h2>How it works</h2>
<p>The setup has two models: a large pretrained <strong>teacher</strong> and a smaller <strong>student</strong> that we want to train. Instead of training the student only on ground truth labels, we also train it to match the teacher's output distribution. The student loss becomes a combination of two things: the standard cross-entropy against the real data, and a distillation loss against the teacher's soft outputs.</p>
<div class="diagram">
<div class="diagram-box teacher">
Teacher model<br>
<small>GPT-2, Llama 3, etc.</small>
<small>billions of params</small>
</div>
<div class="diagram-arrow"></div>
<div class="diagram-box">
soft labels<br>
<small>P(token | context)</small>
<small>full distribution</small>
</div>
<div class="diagram-arrow"></div>
<div class="diagram-box student">
Student model<br>
<small>Supra Mini, etc.</small>
<small>millions of params</small>
</div>
</div>
<p>The distillation loss is usually KL divergence between the teacher and student distributions. A temperature parameter <code>T</code> is applied to soften both distributions before computing the loss, which helps the student focus on the relative ordering of probabilities rather than just the peak.</p>
<div class="callout">
<span>// distillation loss formula (simplified)</span>
L_total = (1 - α) × L_CE(student, ground_truth)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+ α × T² × KL(teacher_soft || student_soft)<br>
<br>
T = temperature (typically 2.0 to 4.0)<br>
α = weight of distillation vs hard label loss
</div>
<h2>Why temperature matters</h2>
<p>At <code>T=1</code>, the teacher's distribution is sharp: the top token gets most of the probability. At higher temperatures, the distribution flattens out and the student gets to see more signal about which tokens are "almost right". This is especially useful for rare tokens that the model almost never picks but that still carry meaningful relationships. The T² factor in the loss formula compensates for the scale change that temperature introduces.</p>
<h2>Types of distillation</h2>
<p>There are a few different flavors worth knowing about:</p>
<ul>
<li><strong>Output distillation (classic):</strong> student matches the teacher's final token probabilities. This is the original Hinton et al. approach and still the most common.</li>
<li><strong>Feature distillation:</strong> student also learns to match the teacher's internal hidden states, not just the output. More expensive but can transfer deeper representations.</li>
<li><strong>Sequence-level distillation:</strong> instead of token-level matching, the student learns from full sequences generated by the teacher. Works well for tasks like translation.</li>
<li><strong>Data-free distillation:</strong> no original training data needed. The teacher generates synthetic data that is used to train the student. Very useful when the original dataset is proprietary.</li>
</ul>
<h2>Real world results</h2>
<p>The most famous example is DistilBERT (2019), which kept 97% of BERT's performance at 40% of the size and 60% of the speed. More recently, distillation is a core part of how models like Phi-2 and Gemma achieve strong results at small scales despite training on relatively little data compared to frontier models.</p>
<div class="table-wrap">
<table>
<thead>
<tr>
<th>Model</th>
<th>Params</th>
<th>Teacher</th>
<th>Performance retained</th>
</tr>
</thead>
<tbody>
<tr><td>DistilBERT</td><td>66M</td><td>BERT (110M)</td><td>~97% on GLUE</td></tr>
<tr><td>DistilGPT-2</td><td>82M</td><td>GPT-2 (124M)</td><td>~90% perplexity</td></tr>
<tr><td>TinyLlama</td><td>1.1B</td><td>Llama 2 (7B)</td><td>competitive at scale</td></tr>
</tbody>
</table>
</div>
<h2>What this means for Supra Mini</h2>
<p>Right now, all our Supra Mini models are trained from scratch on raw text. Distillation is one of the experiments on our roadmap. At 8M parameters, the student has very limited capacity, so picking the right teacher and the right distillation targets is going to matter a lot. Too large a teacher and the gap becomes impossible to bridge. Too much weight on the distillation loss and the student stops learning from the actual data.</p>
<p>It is a balancing act, and we are going to try it. If it works at our scale, the gains could be significant. <strong>A Supra Mini that learns from a Llama 3 teacher instead of raw text alone could be a very different model.</strong></p>
<h2>The catch</h2>
<p>Distillation is not free. Running a teacher model forward pass for every training batch adds significant compute cost, especially if the teacher is large. For a project training on consumer hardware, that overhead is real. There are ways to pre-compute teacher logits and cache them, but that requires disk space proportional to the dataset size times the vocab size, which adds up fast at 5B tokens.</p>
<p>The other catch: distillation only works as well as the teacher. If the teacher has biases or blind spots, the student inherits them. You are not getting GPT-4 quality by distilling into an 8M model. You are getting a more efficient version of whatever the teacher actually learned.</p>
<h2>Final thought</h2>
<p><strong>Knowledge distillation is one of the most elegant ideas in deep learning. Instead of training a small model to memorize answers, you train it to think like a bigger one. For tiny models like ours, it might be the key to breaking through the ceiling that raw pretraining alone cannot reach.</strong></p>
<div class="tags">
<span class="tag">#distillation</span>
<span class="tag">#research</span>
<span class="tag">#tinyml</span>
<span class="tag">#training</span>
<span class="tag">#open-source</span>
<span class="tag">#supra-mini</span>
<span class="tag">#edge-ai</span>
</div>
</div>
</article>
<footer>
<p class="mono">&copy; 2026 SupraLabs // Built for the community.</p>
</footer>
</div>
</body>
</html>