Spaces:

SupraLabs
/

Blog

Running

File size: 10,246 Bytes

d216a26

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Why we chose Llama over Mistral, DeepSeek, Qwen and Mamba for Supra Mini v3 | SupraLabs Blog</title>
    <style>
        :root {
            --bg: #0f0f0f;
            --surface: #1a1a1a;
            --border: #333;
            --text: #e0e0e0;
            --accent: #536bfe; /* Supra Blue */
            --muted: #888;
            --font-mono: 'JetBrains Mono', 'Fira Code', monospace;
        }
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        body {
            background-color: var(--bg);
            color: var(--text);
            font-family: 'Inter', -apple-system, sans-serif;
            line-height: 1.6;
            padding: 2rem;
        }
        code, pre, .mono {
            font-family: var(--font-mono);
        }
        .container {
            max-width: 900px;
            margin: 0 auto;
        }
        /* --- Header --- */
        header {
            border-bottom: 2px solid var(--border);
            padding-bottom: 2rem;
            margin-bottom: 3rem;
            display: flex;
            justify-content: space-between;
            align-items: flex-end;
        }
        .logo-area h1 {
            font-size: 1.2rem;
            text-transform: uppercase;
            letter-spacing: 2px;
            color: var(--accent);
            line-height: 1;
            display: flex;
            align-items: center;
            gap: 10px;
        }
        .logo-area a {
            text-decoration: none;
            color: inherit;
        }
        nav a {
            color: var(--text);
            text-decoration: none;
            margin-left: 1.5rem;
            font-size: 0.9rem;
            border-bottom: 1px solid transparent;
        }
        nav a:hover {
            border-bottom: 1px solid var(--accent);
        }
        /* --- Blog Post Layout --- */
        .post-header {
            margin-bottom: 3rem;
        }
        .post-header h2 {
            font-size: 3rem;
            line-height: 1.1;
            margin-bottom: 1rem;
            font-weight: 800;
        }
        .post-meta {
            font-family: var(--font-mono);
            color: var(--accent);
            font-size: 0.9rem;
            margin-bottom: 2rem;
        }
        .post-content {
            background: var(--surface);
            border: 1px solid var(--border);
            padding: 3rem;
            margin-bottom: 4rem;
        }
        .post-content h2 {
            font-size: 1.8rem;
            margin: 2.5rem 0 1rem 0;
            color: var(--accent);
        }
        .post-content h2:first-child {
            margin-top: 0;
        }
        .post-content p {
            margin-bottom: 1.5rem;
            font-size: 1.1rem;
            color: var(--text);
        }
        .post-content ul {
            margin-bottom: 1.5rem;
            padding-left: 1.5rem;
        }
        .post-content li {
            margin-bottom: 0.5rem;
        }
        .post-content strong {
            color: #fff;
        }
        .post-content img.post-logo {
            margin-bottom: 2rem;
            border: 1px solid var(--border);
        }
        /* --- Tags --- */
        .tags {
            display: flex;
            gap: 0.5rem;
            margin-top: 2rem;
        }
        .tag {
            font-family: var(--font-mono);
            font-size: 0.7rem;
            padding: 2px 8px;
            border: 1px solid var(--border);
            border-radius: 4px;
            color: var(--muted);
        }
        /* --- Hardware Stats (Consistent with index) --- */
        .stats-grid {
            display: grid;
            grid-template-columns: 1fr 1fr;
            gap: 1rem;
            margin-top: 4rem;
            border-top: 1px solid var(--border);
            padding-top: 2rem;
        }
        .stat-box {
            padding: 1rem;
            border-left: 2px solid var(--accent);
        }
        .stat-box small {
            display: block;
            color: var(--muted);
            font-family: var(--font-mono);
        }
        footer {
            margin-top: 6rem;
            padding-bottom: 2rem;
            font-size: 0.8rem;
            color: var(--muted);
            text-align: center;
        }
        @media (max-width: 600px) {
            .post-header h2 { font-size: 2rem; }
            .post-content { padding: 1.5rem; }
            header { flex-direction: column; align-items: flex-start; gap: 1rem; }
            nav a { margin-left: 0; margin-right: 1rem; }
            .stats-grid { grid-template-columns: 1fr; }
        }
        .logo-area {
            display: flex;
            align-items: center;
            gap: 10px;
            font-weight: bold;
            font-size: 1.2rem;
        }
    </style>
</head>
<body>

    <div class="container">
        <header>
            <div class="logo-area" style="font-size: 1.5em;">
                <a href="./index.html"><h1><img src="./image.png" style="height: 2em"> SupraLabs_</h1></a>
            </div>
            <nav>
                <a href="./index.html#news">News</a>
                <a href="https://huggingface.co/SupraLabs" target="blank">HuggingFace</a>
                <a href="./index.html#hardware">Hardware</a>
            </nav>
        </header>

        <article>
            <div class="post-header">
                <div class="post-meta">// 2026-05-12 | Research</div>
                <h2>Why we chose Llama over Mistral,<br>DeepSeek, Qwen and Mamba for Supra Mini v3</h2>
            </div>
            <div class="post-content">
                <p>Welcome Back! Before training Supra Mini v3, we asked ourselves a serious question: <strong>is Llama really the best architecture for a ~500k parameter model?</strong> Here is what we found.</p>
        
                <h2>The Question</h2>
                <p>The AI world is full of exciting architectures. Mistral, DeepSeek, Qwen, Mamba — all of them have shown impressive results. But impressive at what scale? We did the research so you don't have to.</p>
        
                <h2>Mistral</h2>
                <p>Mistral introduced two key innovations: <strong>Grouped Query Attention (GQA)</strong> and <strong>Sliding Window Attention (SWA)</strong>. GQA reduces memory usage by sharing key/value heads across query groups. SWA limits attention to a local window to handle very long contexts efficiently.</p>
                <p>The problem? <strong>Both features are designed for billion-parameter models with very long contexts.</strong> At 467k parameters with a context window of 512 tokens, SWA brings zero benefit — and GQA with only 8 heads adds complexity without saving anything meaningful. Mistral is a great architecture, just not at our scale.</p>
        
                <h2>DeepSeek</h2>
                <p>DeepSeek's biggest innovation is <strong>Mixture of Experts (MoE)</strong> — instead of activating the whole network for every token, only a subset of "expert" sub-networks is activated. This allows massive models to be efficient at inference time.</p>
                <p>Again, the keyword is <strong>massive</strong>. MoE only makes sense when you have enough parameters to split into meaningful expert groups. At 467k total parameters, splitting into experts would leave each expert with almost nothing to learn. <strong>DeepSeek is the wrong tool for this job.</strong></p>
        
                <h2>Qwen</h2>
                <p>Qwen is a solid architecture with a well-designed tokenizer and good multilingual support. At large scales it performs very competitively. But at sub-1M parameters, <strong>the architectural differences between Qwen and Llama are negligible</strong> — both use RoPE, RMSNorm, and SwiGLU FFN under the hood. Switching to Qwen would bring no measurable improvement for v3.</p>
        
                <h2>Mamba (State Space Models)</h2>
                <p>Mamba was the most interesting alternative we considered. State Space Models replace attention entirely with a recurrent mechanism that scales linearly with sequence length — theoretically very efficient for small models and long contexts.</p>
                <p>In theory, <strong>Mamba could be a great fit for tiny models</strong>. In practice, HuggingFace support is still unstable, training on a Kaggle T4 would require significant custom work, and debugging issues mid-training would be painful. The risk outweighs the potential gain for a project at our stage. We are keeping an eye on it for future versions.</p>
        
                <h2>So why Llama?</h2>
                <p>At ~500k parameters, <strong>all major architectures share the same core building blocks</strong>: RoPE positional encoding, RMSNorm, SwiGLU feed-forward networks. The fancy additions of each architecture are optimizations for scale — and scale is exactly what we don't have.</p>
                <p>Llama wins because it is <strong>stable, well documented, natively supported in HuggingFace Transformers, and battle-tested at tiny scales</strong>. No surprises, no custom kernels, no fragile integrations. For Supra Mini v3, the real gains come from vocabulary size, data quality, and training setup — not from the architecture. And those we already optimized.</p>
        
                <h2>Final thought</h2>
                <p><strong>Choosing the right architecture is not about picking the newest or most complex one — it is about picking the right tool for your scale. For sub-1M models, Llama is that tool.</strong></p>
                <div class="tags">
                    <span class="tag">#research</span>
                    <span class="tag">#architecture</span>
                    <span class="tag">#llama</span>
                    <span class="tag">#open-source</span>
                    <span class="tag">#tinyml</span>
                </div>
            </div>
        </article>

        <footer>
            <p class="mono">&copy; 2026 SupraLabs // Built for the community.</p>
        </footer>
    </div>

</body>
</html>