Spaces:
Running
Running
File size: 10,246 Bytes
d216a26 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Why we chose Llama over Mistral, DeepSeek, Qwen and Mamba for Supra Mini v3 | SupraLabs Blog</title>
<style>
:root {
--bg: #0f0f0f;
--surface: #1a1a1a;
--border: #333;
--text: #e0e0e0;
--accent: #536bfe; /* Supra Blue */
--muted: #888;
--font-mono: 'JetBrains Mono', 'Fira Code', monospace;
}
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
background-color: var(--bg);
color: var(--text);
font-family: 'Inter', -apple-system, sans-serif;
line-height: 1.6;
padding: 2rem;
}
code, pre, .mono {
font-family: var(--font-mono);
}
.container {
max-width: 900px;
margin: 0 auto;
}
/* --- Header --- */
header {
border-bottom: 2px solid var(--border);
padding-bottom: 2rem;
margin-bottom: 3rem;
display: flex;
justify-content: space-between;
align-items: flex-end;
}
.logo-area h1 {
font-size: 1.2rem;
text-transform: uppercase;
letter-spacing: 2px;
color: var(--accent);
line-height: 1;
display: flex;
align-items: center;
gap: 10px;
}
.logo-area a {
text-decoration: none;
color: inherit;
}
nav a {
color: var(--text);
text-decoration: none;
margin-left: 1.5rem;
font-size: 0.9rem;
border-bottom: 1px solid transparent;
}
nav a:hover {
border-bottom: 1px solid var(--accent);
}
/* --- Blog Post Layout --- */
.post-header {
margin-bottom: 3rem;
}
.post-header h2 {
font-size: 3rem;
line-height: 1.1;
margin-bottom: 1rem;
font-weight: 800;
}
.post-meta {
font-family: var(--font-mono);
color: var(--accent);
font-size: 0.9rem;
margin-bottom: 2rem;
}
.post-content {
background: var(--surface);
border: 1px solid var(--border);
padding: 3rem;
margin-bottom: 4rem;
}
.post-content h2 {
font-size: 1.8rem;
margin: 2.5rem 0 1rem 0;
color: var(--accent);
}
.post-content h2:first-child {
margin-top: 0;
}
.post-content p {
margin-bottom: 1.5rem;
font-size: 1.1rem;
color: var(--text);
}
.post-content ul {
margin-bottom: 1.5rem;
padding-left: 1.5rem;
}
.post-content li {
margin-bottom: 0.5rem;
}
.post-content strong {
color: #fff;
}
.post-content img.post-logo {
margin-bottom: 2rem;
border: 1px solid var(--border);
}
/* --- Tags --- */
.tags {
display: flex;
gap: 0.5rem;
margin-top: 2rem;
}
.tag {
font-family: var(--font-mono);
font-size: 0.7rem;
padding: 2px 8px;
border: 1px solid var(--border);
border-radius: 4px;
color: var(--muted);
}
/* --- Hardware Stats (Consistent with index) --- */
.stats-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
margin-top: 4rem;
border-top: 1px solid var(--border);
padding-top: 2rem;
}
.stat-box {
padding: 1rem;
border-left: 2px solid var(--accent);
}
.stat-box small {
display: block;
color: var(--muted);
font-family: var(--font-mono);
}
footer {
margin-top: 6rem;
padding-bottom: 2rem;
font-size: 0.8rem;
color: var(--muted);
text-align: center;
}
@media (max-width: 600px) {
.post-header h2 { font-size: 2rem; }
.post-content { padding: 1.5rem; }
header { flex-direction: column; align-items: flex-start; gap: 1rem; }
nav a { margin-left: 0; margin-right: 1rem; }
.stats-grid { grid-template-columns: 1fr; }
}
.logo-area {
display: flex;
align-items: center;
gap: 10px;
font-weight: bold;
font-size: 1.2rem;
}
</style>
</head>
<body>
<div class="container">
<header>
<div class="logo-area" style="font-size: 1.5em;">
<a href="./index.html"><h1><img src="./image.png" style="height: 2em"> SupraLabs_</h1></a>
</div>
<nav>
<a href="./index.html#news">News</a>
<a href="https://huggingface.co/SupraLabs" target="blank">HuggingFace</a>
<a href="./index.html#hardware">Hardware</a>
</nav>
</header>
<article>
<div class="post-header">
<div class="post-meta">// 2026-05-12 | Research</div>
<h2>Why we chose Llama over Mistral,<br>DeepSeek, Qwen and Mamba for Supra Mini v3</h2>
</div>
<div class="post-content">
<p>Welcome Back! Before training Supra Mini v3, we asked ourselves a serious question: <strong>is Llama really the best architecture for a ~500k parameter model?</strong> Here is what we found.</p>
<h2>The Question</h2>
<p>The AI world is full of exciting architectures. Mistral, DeepSeek, Qwen, Mamba β all of them have shown impressive results. But impressive at what scale? We did the research so you don't have to.</p>
<h2>Mistral</h2>
<p>Mistral introduced two key innovations: <strong>Grouped Query Attention (GQA)</strong> and <strong>Sliding Window Attention (SWA)</strong>. GQA reduces memory usage by sharing key/value heads across query groups. SWA limits attention to a local window to handle very long contexts efficiently.</p>
<p>The problem? <strong>Both features are designed for billion-parameter models with very long contexts.</strong> At 467k parameters with a context window of 512 tokens, SWA brings zero benefit β and GQA with only 8 heads adds complexity without saving anything meaningful. Mistral is a great architecture, just not at our scale.</p>
<h2>DeepSeek</h2>
<p>DeepSeek's biggest innovation is <strong>Mixture of Experts (MoE)</strong> β instead of activating the whole network for every token, only a subset of "expert" sub-networks is activated. This allows massive models to be efficient at inference time.</p>
<p>Again, the keyword is <strong>massive</strong>. MoE only makes sense when you have enough parameters to split into meaningful expert groups. At 467k total parameters, splitting into experts would leave each expert with almost nothing to learn. <strong>DeepSeek is the wrong tool for this job.</strong></p>
<h2>Qwen</h2>
<p>Qwen is a solid architecture with a well-designed tokenizer and good multilingual support. At large scales it performs very competitively. But at sub-1M parameters, <strong>the architectural differences between Qwen and Llama are negligible</strong> β both use RoPE, RMSNorm, and SwiGLU FFN under the hood. Switching to Qwen would bring no measurable improvement for v3.</p>
<h2>Mamba (State Space Models)</h2>
<p>Mamba was the most interesting alternative we considered. State Space Models replace attention entirely with a recurrent mechanism that scales linearly with sequence length β theoretically very efficient for small models and long contexts.</p>
<p>In theory, <strong>Mamba could be a great fit for tiny models</strong>. In practice, HuggingFace support is still unstable, training on a Kaggle T4 would require significant custom work, and debugging issues mid-training would be painful. The risk outweighs the potential gain for a project at our stage. We are keeping an eye on it for future versions.</p>
<h2>So why Llama?</h2>
<p>At ~500k parameters, <strong>all major architectures share the same core building blocks</strong>: RoPE positional encoding, RMSNorm, SwiGLU feed-forward networks. The fancy additions of each architecture are optimizations for scale β and scale is exactly what we don't have.</p>
<p>Llama wins because it is <strong>stable, well documented, natively supported in HuggingFace Transformers, and battle-tested at tiny scales</strong>. No surprises, no custom kernels, no fragile integrations. For Supra Mini v3, the real gains come from vocabulary size, data quality, and training setup β not from the architecture. And those we already optimized.</p>
<h2>Final thought</h2>
<p><strong>Choosing the right architecture is not about picking the newest or most complex one β it is about picking the right tool for your scale. For sub-1M models, Llama is that tool.</strong></p>
<div class="tags">
<span class="tag">#research</span>
<span class="tag">#architecture</span>
<span class="tag">#llama</span>
<span class="tag">#open-source</span>
<span class="tag">#tinyml</span>
</div>
</div>
</article>
<footer>
<p class="mono">© 2026 SupraLabs // Built for the community.</p>
</footer>
</div>
</body>
</html> |