File size: 10,246 Bytes
d216a26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Why we chose Llama over Mistral, DeepSeek, Qwen and Mamba for Supra Mini v3 | SupraLabs Blog</title>
    <style>
        :root {
            --bg: #0f0f0f;
            --surface: #1a1a1a;
            --border: #333;
            --text: #e0e0e0;
            --accent: #536bfe; /* Supra Blue */
            --muted: #888;
            --font-mono: 'JetBrains Mono', 'Fira Code', monospace;
        }
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        body {
            background-color: var(--bg);
            color: var(--text);
            font-family: 'Inter', -apple-system, sans-serif;
            line-height: 1.6;
            padding: 2rem;
        }
        code, pre, .mono {
            font-family: var(--font-mono);
        }
        .container {
            max-width: 900px;
            margin: 0 auto;
        }
        /* --- Header --- */
        header {
            border-bottom: 2px solid var(--border);
            padding-bottom: 2rem;
            margin-bottom: 3rem;
            display: flex;
            justify-content: space-between;
            align-items: flex-end;
        }
        .logo-area h1 {
            font-size: 1.2rem;
            text-transform: uppercase;
            letter-spacing: 2px;
            color: var(--accent);
            line-height: 1;
            display: flex;
            align-items: center;
            gap: 10px;
        }
        .logo-area a {
            text-decoration: none;
            color: inherit;
        }
        nav a {
            color: var(--text);
            text-decoration: none;
            margin-left: 1.5rem;
            font-size: 0.9rem;
            border-bottom: 1px solid transparent;
        }
        nav a:hover {
            border-bottom: 1px solid var(--accent);
        }
        /* --- Blog Post Layout --- */
        .post-header {
            margin-bottom: 3rem;
        }
        .post-header h2 {
            font-size: 3rem;
            line-height: 1.1;
            margin-bottom: 1rem;
            font-weight: 800;
        }
        .post-meta {
            font-family: var(--font-mono);
            color: var(--accent);
            font-size: 0.9rem;
            margin-bottom: 2rem;
        }
        .post-content {
            background: var(--surface);
            border: 1px solid var(--border);
            padding: 3rem;
            margin-bottom: 4rem;
        }
        .post-content h2 {
            font-size: 1.8rem;
            margin: 2.5rem 0 1rem 0;
            color: var(--accent);
        }
        .post-content h2:first-child {
            margin-top: 0;
        }
        .post-content p {
            margin-bottom: 1.5rem;
            font-size: 1.1rem;
            color: var(--text);
        }
        .post-content ul {
            margin-bottom: 1.5rem;
            padding-left: 1.5rem;
        }
        .post-content li {
            margin-bottom: 0.5rem;
        }
        .post-content strong {
            color: #fff;
        }
        .post-content img.post-logo {
            margin-bottom: 2rem;
            border: 1px solid var(--border);
        }
        /* --- Tags --- */
        .tags {
            display: flex;
            gap: 0.5rem;
            margin-top: 2rem;
        }
        .tag {
            font-family: var(--font-mono);
            font-size: 0.7rem;
            padding: 2px 8px;
            border: 1px solid var(--border);
            border-radius: 4px;
            color: var(--muted);
        }
        /* --- Hardware Stats (Consistent with index) --- */
        .stats-grid {
            display: grid;
            grid-template-columns: 1fr 1fr;
            gap: 1rem;
            margin-top: 4rem;
            border-top: 1px solid var(--border);
            padding-top: 2rem;
        }
        .stat-box {
            padding: 1rem;
            border-left: 2px solid var(--accent);
        }
        .stat-box small {
            display: block;
            color: var(--muted);
            font-family: var(--font-mono);
        }
        footer {
            margin-top: 6rem;
            padding-bottom: 2rem;
            font-size: 0.8rem;
            color: var(--muted);
            text-align: center;
        }
        @media (max-width: 600px) {
            .post-header h2 { font-size: 2rem; }
            .post-content { padding: 1.5rem; }
            header { flex-direction: column; align-items: flex-start; gap: 1rem; }
            nav a { margin-left: 0; margin-right: 1rem; }
            .stats-grid { grid-template-columns: 1fr; }
        }
        .logo-area {
            display: flex;
            align-items: center;
            gap: 10px;
            font-weight: bold;
            font-size: 1.2rem;
        }
    </style>
</head>
<body>

    <div class="container">
        <header>
            <div class="logo-area" style="font-size: 1.5em;">
                <a href="./index.html"><h1><img src="./image.png" style="height: 2em"> SupraLabs_</h1></a>
            </div>
            <nav>
                <a href="./index.html#news">News</a>
                <a href="https://huggingface.co/SupraLabs" target="blank">HuggingFace</a>
                <a href="./index.html#hardware">Hardware</a>
            </nav>
        </header>

        <article>
            <div class="post-header">
                <div class="post-meta">// 2026-05-12 | Research</div>
                <h2>Why we chose Llama over Mistral,<br>DeepSeek, Qwen and Mamba for Supra Mini v3</h2>
            </div>
            <div class="post-content">
                <p>Welcome Back! Before training Supra Mini v3, we asked ourselves a serious question: <strong>is Llama really the best architecture for a ~500k parameter model?</strong> Here is what we found.</p>
        
                <h2>The Question</h2>
                <p>The AI world is full of exciting architectures. Mistral, DeepSeek, Qwen, Mamba β€” all of them have shown impressive results. But impressive at what scale? We did the research so you don't have to.</p>
        
                <h2>Mistral</h2>
                <p>Mistral introduced two key innovations: <strong>Grouped Query Attention (GQA)</strong> and <strong>Sliding Window Attention (SWA)</strong>. GQA reduces memory usage by sharing key/value heads across query groups. SWA limits attention to a local window to handle very long contexts efficiently.</p>
                <p>The problem? <strong>Both features are designed for billion-parameter models with very long contexts.</strong> At 467k parameters with a context window of 512 tokens, SWA brings zero benefit β€” and GQA with only 8 heads adds complexity without saving anything meaningful. Mistral is a great architecture, just not at our scale.</p>
        
                <h2>DeepSeek</h2>
                <p>DeepSeek's biggest innovation is <strong>Mixture of Experts (MoE)</strong> β€” instead of activating the whole network for every token, only a subset of "expert" sub-networks is activated. This allows massive models to be efficient at inference time.</p>
                <p>Again, the keyword is <strong>massive</strong>. MoE only makes sense when you have enough parameters to split into meaningful expert groups. At 467k total parameters, splitting into experts would leave each expert with almost nothing to learn. <strong>DeepSeek is the wrong tool for this job.</strong></p>
        
                <h2>Qwen</h2>
                <p>Qwen is a solid architecture with a well-designed tokenizer and good multilingual support. At large scales it performs very competitively. But at sub-1M parameters, <strong>the architectural differences between Qwen and Llama are negligible</strong> β€” both use RoPE, RMSNorm, and SwiGLU FFN under the hood. Switching to Qwen would bring no measurable improvement for v3.</p>
        
                <h2>Mamba (State Space Models)</h2>
                <p>Mamba was the most interesting alternative we considered. State Space Models replace attention entirely with a recurrent mechanism that scales linearly with sequence length β€” theoretically very efficient for small models and long contexts.</p>
                <p>In theory, <strong>Mamba could be a great fit for tiny models</strong>. In practice, HuggingFace support is still unstable, training on a Kaggle T4 would require significant custom work, and debugging issues mid-training would be painful. The risk outweighs the potential gain for a project at our stage. We are keeping an eye on it for future versions.</p>
        
                <h2>So why Llama?</h2>
                <p>At ~500k parameters, <strong>all major architectures share the same core building blocks</strong>: RoPE positional encoding, RMSNorm, SwiGLU feed-forward networks. The fancy additions of each architecture are optimizations for scale β€” and scale is exactly what we don't have.</p>
                <p>Llama wins because it is <strong>stable, well documented, natively supported in HuggingFace Transformers, and battle-tested at tiny scales</strong>. No surprises, no custom kernels, no fragile integrations. For Supra Mini v3, the real gains come from vocabulary size, data quality, and training setup β€” not from the architecture. And those we already optimized.</p>
        
                <h2>Final thought</h2>
                <p><strong>Choosing the right architecture is not about picking the newest or most complex one β€” it is about picking the right tool for your scale. For sub-1M models, Llama is that tool.</strong></p>
                <div class="tags">
                    <span class="tag">#research</span>
                    <span class="tag">#architecture</span>
                    <span class="tag">#llama</span>
                    <span class="tag">#open-source</span>
                    <span class="tag">#tinyml</span>
                </div>
            </div>
        </article>

        <footer>
            <p class="mono">&copy; 2026 SupraLabs // Built for the community.</p>
        </footer>
    </div>

</body>
</html>