File size: 15,627 Bytes
79e2418
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Knowledge Distillation | SupraLabs Blog</title>
    <style>
        :root {
            --bg: #0f0f0f;
            --surface: #1a1a1a;
            --border: #333;
            --text: #e0e0e0;
            --accent: #536bfe;
            --muted: #888;
            --font-mono: 'JetBrains Mono', 'Fira Code', monospace;
        }
        * { margin: 0; padding: 0; box-sizing: border-box; }
        body {
            background-color: var(--bg);
            color: var(--text);
            font-family: 'Inter', -apple-system, sans-serif;
            line-height: 1.6;
            padding: 2rem;
        }
        code, pre, .mono { font-family: var(--font-mono); }
        .container { max-width: 900px; margin: 0 auto; }

        header {
            border-bottom: 2px solid var(--border);
            padding-bottom: 2rem;
            margin-bottom: 3rem;
            display: flex;
            justify-content: space-between;
            align-items: flex-end;
        }
        .logo-area h1 {
            font-size: 1.2rem;
            text-transform: uppercase;
            letter-spacing: 2px;
            color: var(--accent);
            line-height: 1;
            display: flex;
            align-items: center;
            gap: 10px;
        }
        .logo-area a { text-decoration: none; color: inherit; }
        .logo-area {
            display: flex;
            align-items: center;
            gap: 10px;
            font-weight: bold;
            font-size: 1.2rem;
        }
        nav a {
            color: var(--text);
            text-decoration: none;
            margin-left: 1.5rem;
            font-size: 0.9rem;
            border-bottom: 1px solid transparent;
        }
        nav a:hover { border-bottom: 1px solid var(--accent); }

        .post-header { margin-bottom: 3rem; }
        .post-header h2 {
            font-size: 3rem;
            line-height: 1.1;
            margin-bottom: 1rem;
            font-weight: 800;
        }
        .post-meta {
            font-family: var(--font-mono);
            color: var(--accent);
            font-size: 0.9rem;
            margin-bottom: 2rem;
        }
        .post-content {
            background: var(--surface);
            border: 1px solid var(--border);
            padding: 3rem;
            margin-bottom: 4rem;
        }
        .post-content h2 {
            font-size: 1.8rem;
            margin: 2.5rem 0 1rem 0;
            color: var(--accent);
        }
        .post-content h2:first-child { margin-top: 0; }
        .post-content p {
            margin-bottom: 1.5rem;
            font-size: 1.1rem;
            color: var(--text);
        }
        .post-content ul {
            margin-bottom: 1.5rem;
            padding-left: 1.5rem;
        }
        .post-content li { margin-bottom: 0.5rem; font-size: 1.1rem; }
        .post-content strong { color: #fff; }

        .post-content code {
            background: #111;
            border: 1px solid var(--border);
            padding: 2px 6px;
            border-radius: 3px;
            font-size: 0.95em;
            color: var(--accent);
        }

        .callout {
            border-left: 3px solid var(--accent);
            background: #111;
            padding: 1rem 1.5rem;
            margin: 2rem 0;
            font-family: var(--font-mono);
            font-size: 0.95rem;
            color: #ccc;
        }
        .callout span {
            display: block;
            color: var(--muted);
            font-size: 0.8rem;
            margin-bottom: 0.4rem;
        }

        .table-wrap { overflow-x: auto; margin: 2rem 0; }
        table {
            width: 100%;
            border-collapse: collapse;
            font-family: var(--font-mono);
            font-size: 0.9rem;
        }
        th {
            background: #111;
            color: var(--accent);
            padding: 0.75rem 1rem;
            text-align: left;
            border: 1px solid var(--border);
        }
        td {
            padding: 0.7rem 1rem;
            border: 1px solid var(--border);
            color: var(--text);
        }
        tr:nth-child(even) td { background: #111; }

        /* Diagram boxes */
        .diagram {
            display: flex;
            align-items: center;
            justify-content: center;
            gap: 1rem;
            margin: 2rem 0;
            flex-wrap: wrap;
        }
        .diagram-box {
            background: #111;
            border: 1px solid var(--border);
            padding: 1.2rem 1.8rem;
            text-align: center;
            font-family: var(--font-mono);
            font-size: 0.85rem;
        }
        .diagram-box.teacher {
            border-color: var(--accent);
            color: var(--accent);
        }
        .diagram-box.student {
            border-color: #888;
            color: #ccc;
        }
        .diagram-box small {
            display: block;
            color: var(--muted);
            font-size: 0.7rem;
            margin-top: 0.3rem;
        }
        .diagram-arrow {
            color: var(--accent);
            font-size: 1.5rem;
            font-family: var(--font-mono);
        }

        .tags { display: flex; gap: 0.5rem; margin-top: 2rem; flex-wrap: wrap; }
        .tag {
            font-family: var(--font-mono);
            font-size: 0.7rem;
            padding: 2px 8px;
            border: 1px solid var(--border);
            border-radius: 4px;
            color: var(--muted);
        }

        footer {
            margin-top: 6rem;
            padding-bottom: 2rem;
            font-size: 0.8rem;
            color: var(--muted);
            text-align: center;
        }

        @media (max-width: 600px) {
            .post-header h2 { font-size: 2rem; }
            .post-content { padding: 1.5rem; }
            header { flex-direction: column; align-items: flex-start; gap: 1rem; }
            nav a { margin-left: 0; margin-right: 1rem; }
            .diagram { flex-direction: column; }
            .diagram-arrow { transform: rotate(90deg); }
        }
    </style>
</head>
<body>

    <div class="container">
        <header>
            <div class="logo-area" style="font-size: 1.5em;">
                <a href="./index.html"><h1><img src="./image.png" style="height: 2em"> SupraLabs_</h1></a>
            </div>
            <nav>
                <a href="./index.html#news">News</a>
                <a href="https://huggingface.co/SupraLabs" target="blank">HuggingFace</a>
                <a href="./index.html#hardware">Hardware</a>
            </nav>
        </header>

        <article>
            <div class="post-header">
                <div class="post-meta">// 2026-05-19 | Research</div>
                <h2>Knowledge Distillation:<br>teaching small models<br>to think big.</h2>
            </div>

            <div class="post-content">

                <p>What if a tiny model could learn not just <em>what</em> the right answer is, but <em>how confident</em> a much larger model is about every possible answer? That is the core idea behind knowledge distillation, and it is one of the most powerful tools we have for making small models punch above their weight.</p>

                <h2>The problem with hard labels</h2>
                <p>When you train a model normally, it learns from hard labels: the correct answer is class A, everything else is wrong. Simple. But that throws away a lot of useful information. When a large model predicts the next token, it does not just pick one winner. It produces a full probability distribution over the entire vocabulary. Maybe it gives 60% confidence to the word "learning", 15% to "understanding", 10% to "knowledge". Those <strong>soft probabilities carry structure</strong> that a hard label never could.</p>
                <p>Training on hard labels is like giving a student only the answer key. Knowledge distillation is like giving them the teacher's thought process too.</p>

                <h2>How it works</h2>
                <p>The setup has two models: a large pretrained <strong>teacher</strong> and a smaller <strong>student</strong> that we want to train. Instead of training the student only on ground truth labels, we also train it to match the teacher's output distribution. The student loss becomes a combination of two things: the standard cross-entropy against the real data, and a distillation loss against the teacher's soft outputs.</p>

                <div class="diagram">
                    <div class="diagram-box teacher">
                        Teacher model<br>
                        <small>GPT-2, Llama 3, etc.</small>
                        <small>billions of params</small>
                    </div>
                    <div class="diagram-arrow"></div>
                    <div class="diagram-box">
                        soft labels<br>
                        <small>P(token | context)</small>
                        <small>full distribution</small>
                    </div>
                    <div class="diagram-arrow"></div>
                    <div class="diagram-box student">
                        Student model<br>
                        <small>Supra Mini, etc.</small>
                        <small>millions of params</small>
                    </div>
                </div>

                <p>The distillation loss is usually KL divergence between the teacher and student distributions. A temperature parameter <code>T</code> is applied to soften both distributions before computing the loss, which helps the student focus on the relative ordering of probabilities rather than just the peak.</p>

                <div class="callout">
                    <span>// distillation loss formula (simplified)</span>
                    L_total = (1 - α) × L_CE(student, ground_truth)<br>
                    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+ α × T² × KL(teacher_soft || student_soft)<br>
                    <br>
                    T = temperature (typically 2.0 to 4.0)<br>
                    α = weight of distillation vs hard label loss
                </div>

                <h2>Why temperature matters</h2>
                <p>At <code>T=1</code>, the teacher's distribution is sharp: the top token gets most of the probability. At higher temperatures, the distribution flattens out and the student gets to see more signal about which tokens are "almost right". This is especially useful for rare tokens that the model almost never picks but that still carry meaningful relationships. The T² factor in the loss formula compensates for the scale change that temperature introduces.</p>

                <h2>Types of distillation</h2>
                <p>There are a few different flavors worth knowing about:</p>
                <ul>
                    <li><strong>Output distillation (classic):</strong> student matches the teacher's final token probabilities. This is the original Hinton et al. approach and still the most common.</li>
                    <li><strong>Feature distillation:</strong> student also learns to match the teacher's internal hidden states, not just the output. More expensive but can transfer deeper representations.</li>
                    <li><strong>Sequence-level distillation:</strong> instead of token-level matching, the student learns from full sequences generated by the teacher. Works well for tasks like translation.</li>
                    <li><strong>Data-free distillation:</strong> no original training data needed. The teacher generates synthetic data that is used to train the student. Very useful when the original dataset is proprietary.</li>
                </ul>

                <h2>Real world results</h2>
                <p>The most famous example is DistilBERT (2019), which kept 97% of BERT's performance at 40% of the size and 60% of the speed. More recently, distillation is a core part of how models like Phi-2 and Gemma achieve strong results at small scales despite training on relatively little data compared to frontier models.</p>

                <div class="table-wrap">
                    <table>
                        <thead>
                            <tr>
                                <th>Model</th>
                                <th>Params</th>
                                <th>Teacher</th>
                                <th>Performance retained</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr><td>DistilBERT</td><td>66M</td><td>BERT (110M)</td><td>~97% on GLUE</td></tr>
                            <tr><td>DistilGPT-2</td><td>82M</td><td>GPT-2 (124M)</td><td>~90% perplexity</td></tr>
                            <tr><td>TinyLlama</td><td>1.1B</td><td>Llama 2 (7B)</td><td>competitive at scale</td></tr>
                        </tbody>
                    </table>
                </div>

                <h2>What this means for Supra Mini</h2>
                <p>Right now, all our Supra Mini models are trained from scratch on raw text. Distillation is one of the experiments on our roadmap. At 8M parameters, the student has very limited capacity, so picking the right teacher and the right distillation targets is going to matter a lot. Too large a teacher and the gap becomes impossible to bridge. Too much weight on the distillation loss and the student stops learning from the actual data.</p>
                <p>It is a balancing act, and we are going to try it. If it works at our scale, the gains could be significant. <strong>A Supra Mini that learns from a Llama 3 teacher instead of raw text alone could be a very different model.</strong></p>

                <h2>The catch</h2>
                <p>Distillation is not free. Running a teacher model forward pass for every training batch adds significant compute cost, especially if the teacher is large. For a project training on consumer hardware, that overhead is real. There are ways to pre-compute teacher logits and cache them, but that requires disk space proportional to the dataset size times the vocab size, which adds up fast at 5B tokens.</p>
                <p>The other catch: distillation only works as well as the teacher. If the teacher has biases or blind spots, the student inherits them. You are not getting GPT-4 quality by distilling into an 8M model. You are getting a more efficient version of whatever the teacher actually learned.</p>

                <h2>Final thought</h2>
                <p><strong>Knowledge distillation is one of the most elegant ideas in deep learning. Instead of training a small model to memorize answers, you train it to think like a bigger one. For tiny models like ours, it might be the key to breaking through the ceiling that raw pretraining alone cannot reach.</strong></p>

                <div class="tags">
                    <span class="tag">#distillation</span>
                    <span class="tag">#research</span>
                    <span class="tag">#tinyml</span>
                    <span class="tag">#training</span>
                    <span class="tag">#open-source</span>
                    <span class="tag">#supra-mini</span>
                    <span class="tag">#edge-ai</span>
                </div>
            </div>
        </article>

        <footer>
            <p class="mono">&copy; 2026 SupraLabs // Built for the community.</p>
        </footer>
    </div>

</body>
</html>