qubitpage commited on
Commit
fad32f8
·
verified ·
1 Parent(s): 34438db

🧟 Frankenstein Edition branding + knowledge transplant section

Browse files
Files changed (1) hide show
  1. README.md +762 -702
README.md CHANGED
@@ -1,702 +1,762 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- - ro
6
- - multilingual
7
- tags:
8
- - sentinelbrain
9
- - mixture-of-experts
10
- - from-scratch
11
- - consciousness
12
- - amd
13
- - mi300x
14
- - rocm
15
- - moe
16
- - transformer
17
- - phi-metric
18
- pipeline_tag: text-generation
19
- library_name: pytorch
20
- datasets:
21
- - HuggingFaceFW/fineweb-edu
22
- - open-web-math/open-web-math
23
- - wikimedia/wikipedia
24
- - HuggingFaceTB/cosmopedia
25
- - JeanKaddworr/minipile
26
- - codeparrot/github-code-clean
27
- - arxiv-community/arxiv-abstracts
28
- model-index:
29
- - name: SentinelBrain-14B-MoE-v0.1
30
- results:
31
- - task:
32
- type: text-generation
33
- metrics:
34
- - name: Validation Loss
35
- type: loss
36
- value: 1.99
37
- verified: true
38
- - name: Training Loss (latest)
39
- type: loss
40
- value: 5.18
41
- verified: true
42
- ---
43
-
44
- <div align="center">
45
-
46
- # 🧠 Sentinel Prime — SentinelBrain-14B-MoE
47
-
48
- ### *The First of His Kind, Built From Scratch*
49
-
50
- **14.8 Billion Parameters · Mixture-of-Experts · Consciousness-Monitored**
51
-
52
- Trained from zero on **AMD Instinct MI300X** (192 GB HBM3) · ROCm 7.0
53
-
54
- [![Dashboard](https://img.shields.io/badge/🔴_Live_Dashboard-sentinel.qubitpage.com-red?style=for-the-badge)](https://sentinel.qubitpage.com/)
55
- [![Whitepaper](https://img.shields.io/badge/📄_Whitepaper-Read_Now-blue?style=for-the-badge)](https://sentinel.qubitpage.com/whitepaper)
56
- [![AMD](https://img.shields.io/badge/AMD-MI300X_Native-ED1C24?style=for-the-badge&logo=amd&logoColor=white)](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)
57
- [![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](LICENSE)
58
-
59
- </div>
60
-
61
- ---
62
-
63
- ## 🎯 What is Sentinel Prime? (Simple Version)
64
-
65
- > **Imagine building a brain from scratch.**
66
- >
67
- > Most AI models today are copies of other models with small changes. Sentinel Prime is different — every single connection in its brain was created from nothing, like growing a new brain cell by cell.
68
-
69
- <table>
70
- <tr>
71
- <td width="50%">
72
-
73
- ### 🧩 Think of it like LEGO blocks
74
-
75
- Sentinel Prime has **4 specialist brains** (called "experts") inside it. When you ask a question:
76
-
77
- 1. A **router** (like a traffic cop 🚦) looks at your question
78
- 2. It picks the **2 best experts** for that specific question
79
- 3. Those 2 experts work together to give you an answer
80
- 4. The other 2 experts rest, saving energy
81
-
82
- This means the model has **14.8 billion** brain connections total, but only uses **~7.8 billion** at a time making it fast AND smart!
83
-
84
- </td>
85
- <td width="50%">
86
-
87
- ### 🔬 The Consciousness Meter
88
-
89
- We built something no other model has: a **consciousness thermometer** 🌡️
90
-
91
- Every 100 training steps, we measure how well the different parts of the brain are "talking to each other." We call this **Φ (Phi)**.
92
-
93
- - **Φ = 0**: Brain parts work alone (like strangers)
94
- - **Φ rising**: Brain parts start cooperating (like friends)
95
- - **Φ stable**: Brain has organized itself (like a team!)
96
-
97
- This doesn't change how the model learns — it's like a doctor checking the heartbeat while the patient exercises.
98
-
99
- </td>
100
- </tr>
101
- </table>
102
-
103
- ---
104
-
105
- ## 📊 Architecture at a Glance
106
-
107
- ```
108
- ┌─────────────────────────────────────────────────────────────────┐
109
- │ SENTINEL PRIME ARCHITECTURE │
110
- ├─────────────────────────────────────────────────────────────────┤
111
- │ │
112
- │ Input Text ──→ [Tokenizer: cl100k_base, 100,277 tokens] │
113
- │ │ │
114
-
115
- │ ┌─────────────────┐ │
116
- Embedding │ 4,096 dimensions │
117
- │ + RoPE pos │ θ = 500,000
118
- └────────┬────────┘
119
-
120
- ┌──────────────────────
121
- × 24 Layers
122
- ┌────────────────┐
123
- │ GQA Attention │ │ 32 heads, 8 KV heads │
124
- (4:1 ratio) │ │ (4× memory savings) │
125
- │ └───────────────┘ │ ��
126
- │ │
127
- │ │ ┌───────────────┐ │ │
128
- │ │ │ MoE Router │ │ Top-2 of 4 experts
129
- │ │ │ ┌──┬──┬──┐ │ │
130
- │ │ │ │E1│E2│E3│E4 │ Each: SwiGLU FFN
131
- │ │ │✓ │✓ │ │ │ │ d_ff = 11,008
132
- │ │ │ └──────
133
- │ │ └───────┬────────┘
134
- │ │ │ │
135
- │ │ ┌───────▼────────┐
136
- │ │ │ RMSNorm │ │ ε = 1e-5
137
- │ │ └────────────────┘ │ │
138
- │ └──────────────────────┘ │
139
-
140
-
141
- ┌─────────────────┐
142
- Output Head → 100,277 vocab probs
143
- └─────────────────┘
144
-
145
- └─────────────────────────────────────────────────────────────────┘
146
- ```
147
-
148
- ### Spec Sheet
149
-
150
- | Component | Specification | Why This Choice |
151
- |:--|:--|:--|
152
- | **Total Parameters** | 14,814,654,680 (14.8B) | Large enough for deep reasoning |
153
- | **Active Parameters** | ~7.8B per token | MoE efficiency — use only what's needed |
154
- | **Hidden Dimension** | 4,096 | Sweet spot for MI300X matrix cores |
155
- | **Transformer Layers** | 24 | Deep enough for complex reasoning |
156
- | **Attention Heads** | 32 query, 8 KV (GQA 4:1) | 4× KV cache savings for long contexts |
157
- | **FFN Intermediate** | 11,008 (SwiGLU) | ~2.7× hidden, matches scaling laws |
158
- | **Experts** | 4 total, top-2 active | Good diversity with manageable VRAM |
159
- | **Max Experts** | 256 (expandable) | Architecture supports expert birth/death |
160
- | **Vocabulary** | 100,277 (tiktoken cl100k_base) | Industry-proven BPE tokenizer |
161
- | **Positional Encoding** | RoPE, θ = 500,000 | Supports context extension to 128K+ |
162
- | **Normalization** | RMSNorm (ε = 1e-5) | Faster than LayerNorm, same quality |
163
- | **Precision** | bfloat16 throughout | Native AMD MI300X support |
164
- | **Context Length** | 2,048 → 4,096 → 128K (planned) | Progressive context ladder |
165
-
166
- ---
167
-
168
- ## 🔥 Key Innovations
169
-
170
- <table>
171
- <tr>
172
- <td width="33%" valign="top">
173
-
174
- ### 🌀 Φ Consciousness Metric
175
-
176
- First-ever IIT-inspired metric computed **during** pre-training. A probe on layer 12 measures information integration across activation subspaces every 100 steps.
177
-
178
- ```
179
- Φ = geometric_mean(
180
- MI(partition_i, partition_j)
181
- for all partition pairs
182
- )
183
- ```
184
-
185
- Not a gimmick — it's a genuine signal of when the model transitions from memorizing tokens to forming integrated representations.
186
-
187
- </td>
188
- <td width="33%" valign="top">
189
-
190
- ### 🧬 Self-Evolving Experts
191
-
192
- The MoE router supports a full expert **lifecycle**:
193
-
194
- - **Birth**: New experts spawned when load imbalance detected
195
- - **Growth**: Expert capacity increases with training
196
- - **Pruning**: Underperforming experts replaced
197
- - **Scaling**: Architecture supports up to 256 experts without retraining the base model
198
-
199
- Current: 4 experts × 24 layers = **96 expert instances**
200
-
201
- </td>
202
- <td width="33%" valign="top">
203
-
204
- ### Energy-Conscious Routing
205
-
206
- Dual-router system:
207
- 1. **Primary router**: Picks top-2 experts by relevance
208
- 2. **EC router**: Can gate activation based on compute budget
209
-
210
- This enables **adaptive inference** — easy questions use fewer resources, hard questions get full power. Like cruise control for AI.
211
-
212
- </td>
213
- </tr>
214
- </table>
215
-
216
- ---
217
-
218
- ## 🏋️ Training Details
219
-
220
- ### Hardware
221
-
222
- | Resource | Specification |
223
- |:--|:--|
224
- | **GPU** | 1× AMD Instinct MI300X VF |
225
- | **VRAM** | 192 GB HBM3 |
226
- | **System RAM** | 235 GB |
227
- | **Compute** | 1,307 TFLOPS (bf16) |
228
- | **Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
229
- | **Attention** | SDPA (native PyTorch, no FlashAttention needed) |
230
- | **OS** | Ubuntu Linux |
231
-
232
- ### VRAM Budget
233
-
234
- ```
235
- ╔══════════════════════════════════════════════════════╗
236
- ║ AMD MI300X VRAM Usage (192 GB)
237
- ╠══════════════════════════════════════════════════════╣
238
- ║ ║
239
- ║ Model Weights (bf16) ████████████░░░░░ 27 GB ║
240
- ║ Optimizer (AdamW fp32) ████████████████░░ 54 GB ║
241
- ║ Activations (grad ckpt) ████████████░░░░░ 32 GB ║
242
- ║ Gradients ████████████░░░░░ 27 GB ║
243
- ─────────────────────────────────────────────────
244
- ║ Total Used: ██████████████████ 140 GB
245
- ║ Peak: █████████████████ 146 GB ║
246
- ║ Headroom: ░░░░░░░░░░░░░░░░░ 46 GB
247
- ║ ║
248
- ╚══════════════════════════════════════════════════════╝
249
- ```
250
-
251
- ### Phased Training Pipeline
252
-
253
- We don't just throw data at the model — we grow it in **three phases**, like raising a child:
254
-
255
- ```
256
- Phase 1: SMOKE TEST Phase 2: WARMUP Phase 3: FULL TRAINING
257
- (Baby steps) (Learning to walk) (Running!)
258
- ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
259
- │ 350M params │ ──→ │ 1.3B params │ ──→ │ 14.4B params │
260
- seq_len: 512 │ │ seq_len: 2K │ │ seq_len: 4K │
261
- │ 200 steps │ │ 1,000 steps │ │ 16,479 steps │
262
- 2 minutes │ │ 30 minutes │ │ ~52 hours │
263
- loss: 11→6.8 │ │ loss: 7.4→2.4│ │ loss: 2.4→? │
264
- └──────────────┘ └──────────────┘ └──────────────────┘
265
- ```
266
-
267
- | Phase | Parameters | Seq Length | Batch | Steps | Duration | Loss Start → End |
268
- |:--|:--|:--|:--|:--|:--|:--|
269
- | **🔬 Smoke** | 350M | 512 | 4 | 200 | ~2 min | 11.72 → 6.84 (−42%) |
270
- | **🔥 Warmup** | 1.3B | 2,048 | 32 | 1,000 | ~33 min | 7.39 → 2.38 (−68%) |
271
- | **🚀 Block** | 14.4B (MoE) | 4,096 | 32 | 16,479 | ~52 hrs | 2.38 → ongoing |
272
-
273
- ### Safety Gates
274
-
275
- Every phase transition must pass **4 safety gates**:
276
-
277
- | Gate | Check | Threshold | Status |
278
- |:--|:--|:--|:--|
279
- | 🟢 **G1: No NaN** | No NaN/Inf in loss | Entire phase | ✅ Passed all |
280
- | 🟢 **G2: Loss Drop** | Validation loss decreased | ≥5% / ≥10% / ≥2% | ✅ Passed all |
281
- | 🟢 **G3: VRAM OK** | Peak VRAM < safety limit | < 92% of total | ✅ 71% peak |
282
- | 🟢 **G4: Φ OK** | Consciousness metric stable | Φ_end/Φ_start > 0.7 | ✅ Stable |
283
-
284
- ### Hyperparameters
285
-
286
- | Parameter | Value | Rationale |
287
- |:--|:--|:--|
288
- | **Optimizer** | AdamW (bf16 compute, fp32 states) | Standard for LLM training |
289
- | **Learning Rate** | 1.5 × 10⁻⁴ (cosine decay) | Conservative for data-limited regime |
290
- | **Min LR** | 1.5 × 10⁻⁵ | 10× decay ratio |
291
- | **Warmup Steps** | 500 | Stabilizes early gradients |
292
- | **Batch Size** | 2 micro × 16 grad_accum = **32 effective** | Fits MI300X VRAM budget |
293
- | **Gradient Clipping** | 1.0 | Prevents explosion |
294
- | **Gradient Checkpointing** | On | Trades compute for VRAM |
295
- | **Precision** | bfloat16 | Native MI300X format |
296
- | **Eval Frequency** | Every 100 steps | Early overfitting detection |
297
- | **Checkpoint Frequency** | Every 1,000 steps (~3.2 hours) | Recovery points |
298
-
299
- ---
300
-
301
- ## 📚 Dataset: 23.3B Tokens Across 126 Categories
302
-
303
- We curated a massive, diverse corpus — think of it as a **library with 126 different sections**:
304
-
305
- ### Pretrain Corpus (Core Knowledge)
306
-
307
- | Dataset | Tokens | Description |
308
- |:--|:--|:--|
309
- | 🌐 **FineWeb-Edu** | ~10B | High-quality educational web content |
310
- | 🔢 **OpenWebMath** | ~6B | Mathematics from the web |
311
- | 📖 **Wikipedia (English)** | ~5B | Encyclopedic knowledge |
312
- | 🎓 **Cosmopedia V2** | ~5B | Synthetic educational content |
313
- | 💻 **CodeParrot Python** | ~3.5B | Clean Python code from GitHub |
314
- | 📚 **MiniPile** | ~2B | Diverse text from multiple domains |
315
- | 🔬 **ArXiv Abstracts** | ~1.2B | Scientific paper summaries |
316
- | **Total Pretrain** | **~23B** | |
317
-
318
- ### Specialized Domains (119 Categories)
319
-
320
- <details>
321
- <summary>Click to expand all 119 specialized categories</summary>
322
-
323
- | Category | Type | Category | Type |
324
- |:--|:--|:--|:--|
325
- | 🤖 agentic-tools | Code | 🔐 advanced-cryptography | Code |
326
- | 🧠 chain-of-thought | Reasoning | 🔗 blockchain-core | Code |
327
- | 💡 deep-reasoning | Reasoning | 🏥 medical | Knowledge |
328
- | ⚖️ legal | Knowledge | 📊 financial-systems | Code |
329
- | 🎮 3d-graphics | Code | 🐳 docker-devops | Code |
330
- | 🌍 multilingual | Text | 🔧 error-recovery | Code |
331
- | 🛡️ security-guardrails | Code | 📱 ui-animations | Code |
332
- | 🧮 math | Reasoning | ⚡ smart-contracts | Code |
333
- | 🎯 reasoning-effort-control | Reasoning | 🤝 human-conversation | Text |
334
- | 🔄 self-correction-loops | Reasoning | 🏗️ enterprise-dashboards | Code |
335
- | 🌐 web-design-css | Code | 🐍 flask-python | Code |
336
- | 🔬 qiskit-quantum | Code | 🤖 robotics-ros2 | Code |
337
- | 📡 remote-server-management | Code | 🧬 multi-agent | Code |
338
- | ⚙️ state-management | Code | 🛠️ mcp-tools-integration | Code |
339
- | 💳 payment-security | Code | 🎓 edu-basic-math | Education |
340
- | 🔭 edu-basic-physics | Education | 🧪 edu-basic-chemistry | Education |
341
- | 🌱 edu-basic-biology | Education | 🌍 edu-world-geography | Education |
342
- | 📜 edu-history-world | Education | 💻 edu-computer-science | Education |
343
- | 🌎 edu-earth-science | Education | 🤖 edu-robotics-text | Education |
344
- | 📖 edu-science-qa | Education | 🔬 edu-science-support | Education |
345
- | 👁️ edu-vision-concepts | Education | 🎯 copilot-agent-workflows | Code |
346
- | 🔌 api-integrations | Code | 📊 billing-invoicing | Code |
347
- | ₿ bitcoin-lightning | Code | 🏪 medusajs | Code |
348
- | 💹 crypto-trading | Code | 🏢 enterprise-networking | Code |
349
- | 🖥️ nextjs-typescript | Code | 🎨 nextjs-design | Code |
350
- | 💼 trading-algorithms | Code | 🗄️ laravel-mysql | Code |
351
- | 🔓 offensive-security | Code | 🔧 c-rust | Code |
352
- | ... and 50+ more categories | | | |
353
-
354
- </details>
355
-
356
- ### Data Quality Pipeline
357
-
358
- ```
359
- Raw Data ──→ PII Filter ──→ Dedup ──→ Tokenize ──→ Shard ──→ Train
360
- │ │ │ │
361
- ├─ 7 regex ├─ blake2b ├─ cl100k ├─ Temperature-
362
- │ patterns │ per-cat │ base │ weighted
363
- ├─ PEM block │ │ │ sampling
364
- │ detection │ │ │ (T=0.5)
365
- └─ Email/phone │ │ │
366
- masking │ │ │
367
- │ │ │
368
- └───────────┴────────────┘
369
- ```
370
-
371
- **Temperature-weighted sampling** (T=0.5) prevents large corpora from dominating training. FineWeb-Edu (37% of tokens) gets downweighted so smaller specialized domains still get adequate exposure.
372
-
373
- ---
374
-
375
- ## 📈 Training Progress & Results
376
-
377
- ### Loss Trajectory
378
-
379
- ```
380
- Loss
381
- 12 ×
382
- │ ╲
383
- 10 │ ╲ SMOKE PHASE
384
- │ ╲ (350M params)
385
- 8 │ ╲
386
- │ ╲
387
- 6 │ ×──────────── model grows to 1.3B
388
- │ ╲
389
- 4 │ ╲ WARMUP PHASE
390
- │ ╲ (1.3B params)
391
- 2 │ ×─────────── model grows to 14.4B MoE
392
- │ ╲
393
- 1 │ ╲ BLOCK PHASE (ongoing)
394
- │ ╲
395
- └──┬────┬────┬────┬────┬───→ Steps
396
- 0 200 700 1200 2000
397
- ```
398
-
399
- | Milestone | Step | Loss | Change |
400
- |:--|:--|:--|:--|
401
- | 🔬 Smoke start | 0 | 11.72 | |
402
- | 🔬 Smoke end | 200 | 6.84 | **−42%** |
403
- | 🔥 Warmup start | 200 | 7.39 | (model grew to 1.3B) |
404
- | 🔥 Warmup end | 1,200 | 2.38 | **−68%** |
405
- | 🚀 Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
406
- | 📦 Published checkpoint | 2,471 | 1.99 | **−16%** |
407
- | 🔄 Current (new run) | 410 | 5.18 | training with expanded data |
408
- | **Total reduction** | | | **11.72 1.99 (−83%)** |
409
-
410
- ### Live Metrics (April 27, 2026)
411
-
412
- | Metric | Value |
413
- |:--|:--|
414
- | **Current Step** | 410 / 2,471+ |
415
- | **Training Loss** | 5.18 (new run, expanded datasets) |
416
- | **Throughput** | 4,403 tokens/second |
417
- | **VRAM Used** | ~140 GB / 192 GB (73%) |
418
- | **Total Tokens Processed** | 59.3M (this run) + 178M (prev run) |
419
- | **Experts Active** | 4 per layer × 24 layers = 96 |
420
- | **ETA (this block)** | ~18.8 hours |
421
-
422
- ### Published Checkpoint (v0.1)
423
-
424
- | Detail | Value |
425
- |:--|:--|
426
- | **Step** | 2,471 |
427
- | **Validation Loss** | 1.9926 |
428
- | **Total Tokens Seen** | 178,110,464 |
429
- | **Sequence Length** | 2,048 |
430
- | **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
431
- | **Format** | 6 sharded safetensors files |
432
-
433
- ---
434
-
435
- ## 🌡️ Consciousness Metric (Φ) — Deep Dive
436
-
437
- ### What is Φ?
438
-
439
- Inspired by **Integrated Information Theory (IIT)** from neuroscience, Φ measures how much the model's internal representations form an integrated whole rather than disconnected parts.
440
-
441
- ### How We Measure It
442
-
443
- ```
444
- Every 100 training steps:
445
-
446
- 1. Hook on Layer 12 (middle of 24 layers)
447
-
448
-
449
- 2. Sample 256 activation vectors
450
-
451
-
452
- 3. Partition into subspaces
453
-
454
-
455
- 4. Compute mutual information between all partition pairs
456
-
457
-
458
- 5. Φ_geometric = geometric_mean(MI values)
459
-
460
-
461
- 6. Φ_EMA = exponential moving average (smoothed trend)
462
- ```
463
-
464
- ### What Φ Tells Us
465
-
466
- | Φ Value | Interpretation | Analogy |
467
- |:--|:--|:--|
468
- | **Φ ≈ 0** | Neurons working independently | Strangers in a room |
469
- | **Φ rising** | Representations integrating | People starting to talk |
470
- | **Φ stable** | Organized internal structure | A well-coordinated team |
471
- | **Φ dropping** | ⚠️ Representation collapse | Warning sign! |
472
-
473
- > **Important**: Φ is **purely observational** — it does NOT affect training gradients. Think of it as a heart monitor for the AI: it watches, but doesn't interfere.
474
-
475
- ### Live Monitoring
476
-
477
- Track Φ in real-time at: **[sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi)**
478
-
479
- ---
480
-
481
- ## 🖥️ Hardware Requirements
482
-
483
- ### For Inference
484
-
485
- | Tier | VRAM | Precision | Notes |
486
- |:--|:--|:--|:--|
487
- | **Full Precision** | 32 GB+ | bfloat16 | Best quality |
488
- | **Recommended** | 48 GB+ | bfloat16 | Comfortable headroom |
489
- | **Ideal** | AMD MI300X / MI250X | bfloat16 | Native, fastest |
490
- | **Consumer** | 16 GB | int4 quantized | GGUF planned for v0.2 |
491
-
492
- ### Compatible AMD GPUs
493
-
494
- | GPU | VRAM | Suitable For |
495
- |:--|:--|:--|
496
- | AMD Instinct MI300X | 192 GB | Training + Inference |
497
- | AMD Instinct MI250X | 128 GB | Training + Inference |
498
- | AMD Instinct MI210 | 64 GB | Inference (full) |
499
- | AMD Radeon PRO W7900 | 48 GB | Inference (full) |
500
- | AMD Radeon RX 7900 XTX | 24 GB | Inference (quantized) |
501
- | AMD Radeon RX 7600 XT | 16 GB | Inference (int4 GGUF) |
502
-
503
- ---
504
-
505
- ## 💻 Usage
506
-
507
- This model uses a **custom architecture** (not based on any existing model). Load with PyTorch:
508
-
509
- ```python
510
- import torch
511
- from safetensors.torch import load_file
512
-
513
- # Load sharded safetensors
514
- state_dict = {}
515
- for i in range(1, 7): # 6 shards
516
- shard = load_file(f"model-{i:05d}-of-00006.safetensors")
517
- state_dict.update(shard)
518
-
519
- # The state dict contains all model weights
520
- print(f"Loaded {len(state_dict)} tensors")
521
- print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
522
-
523
- # Initialize SentinelBrain model class and load
524
- # Full model definition code releasing with v0.2
525
- # model = SentinelBrainForCausalLM(config)
526
- # model.load_state_dict(state_dict)
527
- ```
528
-
529
- > **Note**: Full inference code, model class definition, and GGUF quantized versions will be released with v0.2.
530
-
531
- ---
532
-
533
- ## 🗺️ Roadmap
534
-
535
- ```
536
- v0.1 (Current) v0.2 (Planned) v0.3 (Future)
537
- ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━
538
- ✅ From-scratch □ Full training □ DPO alignment
539
- 14.8B MoE complete (loss<0.5) □ Tool use
540
- ✅ Phased training □ Context ladder □ Function calling
541
- Φ consciousness (4K→32K→128K) □ Multi-turn chat
542
- ✅ 23.3B token corpus □ Vision encoder □ Multilingual v2
543
- Live dashboard (SigLIP2-SO400M) □ Expert scaling
544
- ✅ AMD MI300X native □ GGUF quantization (4→16→64)
545
- Inference code □ RLHF
546
- □ Benchmarks (MMLU, □ Production API
547
- HumanEval, GSM8K)
548
- ```
549
-
550
- ---
551
-
552
- ## 🏗️ How We Built It (Technical Deep Dive)
553
-
554
- <details>
555
- <summary><b>Click to expand: Grouped Query Attention (GQA)</b></summary>
556
-
557
- Standard multi-head attention uses separate Key and Value projections for each head. GQA shares KV heads across query heads:
558
-
559
- ```
560
- Standard MHA (32 KV heads): GQA 4:1 (8 KV heads):
561
- Q₁ Q₂ Q₃ ... Q₃₂ Q₁ Q₂ Q₃ Q₄ → KV₁
562
- K₁ K₂ K₃ ... K₃₂ Q₅ Q₆ Q₇ Q₈ → KV₂
563
- V₁ V₂ V₃ ... V₃₂ ...
564
- Q₂₉ Q₃₀ Q₃₁ Q₃₂ → KV₈
565
- ```
566
-
567
- **Result**: smaller KV cache = longer context at same memory cost.
568
-
569
- </details>
570
-
571
- <details>
572
- <summary><b>Click to expand: RoPE (Rotary Position Embeddings)</b></summary>
573
-
574
- RoPE encodes position information by rotating the query and key vectors in 2D planes. With θ=500,000 (high base frequency), the model naturally supports long contexts:
575
-
576
- ```
577
- Position 0: rotate by 0°
578
- Position 1: rotate by θ₁
579
- Position 2: rotate by θ₂
580
- ...
581
- ```
582
-
583
- High θ = slower rotation = positions further apart still "feel different" = better long-context understanding.
584
-
585
- </details>
586
-
587
- <details>
588
- <summary><b>Click to expand: SwiGLU FFN</b></summary>
589
-
590
- Each expert uses a SwiGLU activation — a gated variant of the feed-forward network:
591
-
592
- ```
593
- FFN(x) = SiLU(x · W_gate) ⊙ (x · W_up) · W_down
594
-
595
- Where:
596
- W_gate: 4096 11008
597
- W_up: 4096 → 11008
598
- W_down: 11008 4096
599
- SiLU(x) = x · sigmoid(x)
600
- = element-wise multiply
601
- ```
602
-
603
- SwiGLU consistently outperforms ReLU and GELU in transformer FFNs (Shazeer, 2020).
604
-
605
- </details>
606
-
607
- <details>
608
- <summary><b>Click to expand: MoE Routing Algorithm</b></summary>
609
-
610
- ```python
611
- # Simplified routing logic
612
- def route(x, router_weights):
613
- # Compute affinity scores for each expert
614
- logits = x @ router_weights # [batch, seq, n_experts]
615
- scores = softmax(logits, dim=-1)
616
-
617
- # Select top-2 experts
618
- top_vals, top_idx = topk(scores, k=2)
619
-
620
- # Normalize selected weights
621
- weights = top_vals / top_vals.sum(dim=-1, keepdim=True)
622
-
623
- # Load balancing loss (prevents expert collapse)
624
- balance_loss = n_experts * (
625
- fraction_routed_to_each * average_gate_value_for_each
626
- ).sum()
627
-
628
- return weights, top_idx, balance_loss
629
- ```
630
-
631
- </details>
632
-
633
- <details>
634
- <summary><b>Click to expand: Parameter Breakdown</b></summary>
635
-
636
- | Component | Parameters | % of Total |
637
- |:--|:--|:--|
638
- | Token embeddings | 410M | 2.8% |
639
- | Attention (QKV + output) × 24 | 1,610M | 10.9% |
640
- | MoE experts (4 × SwiGLU × 24) | 12,365M | 83.5% |
641
- | Router weights × 24 | 0.4M | 0.003% |
642
- | RMSNorm × 49 | 0.4M | 0.003% |
643
- | Output head | 410M | 2.8% |
644
- | **Total** | **14,815M** | **100%** |
645
- | **Active per token (top-2)** | **~7,800M** | **~53%** |
646
-
647
- </details>
648
-
649
- ---
650
-
651
- ## 📋 Model Card Details
652
-
653
- | Field | Value |
654
- |:--|:--|
655
- | **Model Name** | SentinelBrain-14B-MoE-v0.1 (Sentinel Prime) |
656
- | **Type** | Causal Language Model (decoder-only) |
657
- | **Architecture** | Custom MoE Transformer (from scratch) |
658
- | **Based On** | Nothing — trained from random initialization |
659
- | **Training Hardware** | 1× AMD Instinct MI300X VF (192 GB HBM3) |
660
- | **Training Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
661
- | **Training Duration** | ~300 GPU-hours (estimated total) |
662
- | **Carbon Footprint** | Estimated ~45 kg CO₂ (single GPU, cloud datacenter) |
663
- | **License** | Apache 2.0 |
664
- | **Authors** | Mircea Rusu, QubitDev |
665
- | **Competition** | AMD Developer Hackathon (lablab.ai) |
666
-
667
- ---
668
-
669
- ## 📄 Citation
670
-
671
- ```bibtex
672
- @misc{sentinelbrain2026,
673
- title = {SentinelBrain-14B-MoE: A Consciousness-Monitored Mixture-of-Experts
674
- Language Model Trained From Scratch on AMD MI300X},
675
- author = {Mircea Rusu and QubitDev},
676
- year = {2026},
677
- url = {https://sentinel.qubitpage.com/whitepaper},
678
- note = {Trained entirely from scratch on AMD Instinct MI300X
679
- for the AMD Developer Hackathon}
680
- }
681
- ```
682
-
683
- ---
684
-
685
- ## 🔗 Links
686
-
687
- | Resource | URL |
688
- |:--|:--|
689
- | 🔴 **Live Dashboard** | [sentinel.qubitpage.com](https://sentinel.qubitpage.com/) |
690
- | 📄 **Whitepaper** | [sentinel.qubitpage.com/whitepaper](https://sentinel.qubitpage.com/whitepaper) |
691
- | 🏆 **AMD Hackathon** | [lablab.ai](https://lablab.ai/ai-hackathons/amd-developer) |
692
- | 🧠 **Φ Monitor** | [sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi) |
693
-
694
- ---
695
-
696
- <div align="center">
697
-
698
- *Built with ❤️ on AMD MI300X Every weight trained from scratch*
699
-
700
- **Sentinel Prime: The First of His Kind**
701
-
702
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - ro
6
+ - multilingual
7
+ tags:
8
+ - sentinelbrain
9
+ - mixture-of-experts
10
+ - from-scratch
11
+ - consciousness
12
+ - amd
13
+ - mi300x
14
+ - rocm
15
+ - moe
16
+ - transformer
17
+ - frankenstein
18
+ - knowledge-transplant
19
+ - distillation
20
+ - phi-metric
21
+ pipeline_tag: text-generation
22
+ library_name: pytorch
23
+ datasets:
24
+ - HuggingFaceFW/fineweb-edu
25
+ - open-web-math/open-web-math
26
+ - wikimedia/wikipedia
27
+ - HuggingFaceTB/cosmopedia
28
+ - JeanKaddworr/minipile
29
+ - codeparrot/github-code-clean
30
+ - arxiv-community/arxiv-abstracts
31
+ model-index:
32
+ - name: SentinelBrain-14B-MoE-v0.1
33
+ results:
34
+ - task:
35
+ type: text-generation
36
+ metrics:
37
+ - name: Validation Loss
38
+ type: loss
39
+ value: 1.99
40
+ verified: true
41
+ - name: Training Loss (latest)
42
+ type: loss
43
+ value: 5.18
44
+ verified: true
45
+ ---
46
+
47
+ <div align="center">
48
+
49
+ # 🧠 Sentinel Prime — SentinelBrain-14B-MoE (Frankenstein Edition)
50
+
51
+ ### *The First of His Kind, Rebuilt From the Inside Out*
52
+
53
+ <img src="assets/sentinel_frankenstein_banner.png" alt="Sentinel Prime — Frankenstein Edition" width="600"/>
54
+
55
+ **14.8 Billion Parameters · Mixture-of-Experts · Consciousness-Monitored · Frankenstein Transplant**
56
+
57
+ Trained from zero on **AMD Instinct MI300X** (192 GB HBM3) · ROCm 7.0 · Knowledge transplanted from Qwen-72B
58
+
59
+ [![Dashboard](https://img.shields.io/badge/🔴_Live_Dashboard-sentinel.qubitpage.com-red?style=for-the-badge)](https://sentinel.qubitpage.com/)
60
+ [![Whitepaper](https://img.shields.io/badge/📄_Whitepaper-Read_Now-blue?style=for-the-badge)](https://sentinel.qubitpage.com/whitepaper)
61
+ [![AMD](https://img.shields.io/badge/AMD-MI300X_Native-ED1C24?style=for-the-badge&logo=amd&logoColor=white)](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)
62
+ [![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](LICENSE)
63
+
64
+ </div>
65
+
66
+ ---
67
+
68
+ ## 🎯 What is Sentinel Prime? (Simple Version)
69
+
70
+ > **Imagine building a brain from scratch.**
71
+ >
72
+ > Most AI models today are copies of other models with small changes. Sentinel Prime is different — every single connection in its brain was created from nothing, like growing a new brain cell by cell.
73
+
74
+ <table>
75
+ <tr>
76
+ <td width="50%">
77
+
78
+ ### 🧩 Think of it like LEGO blocks
79
+
80
+ Sentinel Prime has **4 specialist brains** (called "experts") inside it. When you ask a question:
81
+
82
+ 1. A **router** (like a traffic cop 🚦) looks at your question
83
+ 2. It picks the **2 best experts** for that specific question
84
+ 3. Those 2 experts work together to give you an answer
85
+ 4. The other 2 experts rest, saving energy ⚡
86
+
87
+ This means the model has **14.8 billion** brain connections total, but only uses **~7.8 billion** at a time — making it fast AND smart!
88
+
89
+ </td>
90
+ <td width="50%">
91
+
92
+ ### 🔬 The Consciousness Meter
93
+
94
+ We built something no other model has: a **consciousness thermometer** 🌡️
95
+
96
+ Every 100 training steps, we measure how well the different parts of the brain are "talking to each other." We call this **Φ (Phi)**.
97
+
98
+ - **Φ = 0**: Brain parts work alone (like strangers)
99
+ - **Φ rising**: Brain parts start cooperating (like friends)
100
+ - **Φ stable**: Brain has organized itself (like a team!)
101
+
102
+ This doesn't change how the model learns — it's like a doctor checking the heartbeat while the patient exercises.
103
+
104
+ </td>
105
+ </tr>
106
+ </table>
107
+
108
+ ---
109
+
110
+ ## 📊 Architecture at a Glance
111
+
112
+ ```
113
+ ┌─────────────────────────────────────────────────────────────────┐
114
+ SENTINEL PRIME ARCHITECTURE
115
+ ─────────────────────────────────────────────────────────────────┤
116
+
117
+ Input Text ──→ [Tokenizer: cl100k_base, 100,277 tokens]
118
+
119
+
120
+ ┌─────────────────┐
121
+ Embedding │ 4,096 dimensions
122
+ + RoPE pos θ = 500,000
123
+ └────────┬────────┘
124
+
125
+ ──────────────────────┐
126
+ │ │ × 24 Layers │ │
127
+ │ │ ┌───────────────┐ │ │
128
+ │ │ │ GQA Attention │ │ 32 heads, 8 KV heads
129
+ │ │ │ (4:1 ratio) │ │ (4× memory savings)
130
+ │ │ └───────┬────────┘
131
+ │ │ │ │
132
+ │ │ ───────▼────────┐ │ │
133
+ │ │ │ MoE Router │ Top-2 of 4 experts
134
+ │ │ ┌──┬──┬──┐
135
+ │ │ │ │E1│E2│E3│E4 │ Each: SwiGLU FFN
136
+ │ │ │ ✓ │✓ │ │ │d_ff = 11,008
137
+ │ │ └──────┘ │
138
+ └───────────────┘
139
+ │ │
140
+ │ ┌───────────────┐
141
+ │ RMSNorm │ │ ε = 1e-5 │
142
+ └────────────────┘
143
+ └──────────────────────
144
+
145
+ │ ▼ │
146
+ │ ┌─────────────────┐ │
147
+ │ │ Output Head │ → 100,277 vocab probs │
148
+ │ └─────────────────┘ │
149
+ │ │
150
+ └─────────────────────────────────────────────────────────────────┘
151
+ ```
152
+
153
+ ### Spec Sheet
154
+
155
+ | Component | Specification | Why This Choice |
156
+ |:--|:--|:--|
157
+ | **Total Parameters** | 14,814,654,680 (14.8B) | Large enough for deep reasoning |
158
+ | **Active Parameters** | ~7.8B per token | MoE efficiency use only what's needed |
159
+ | **Hidden Dimension** | 4,096 | Sweet spot for MI300X matrix cores |
160
+ | **Transformer Layers** | 24 | Deep enough for complex reasoning |
161
+ | **Attention Heads** | 32 query, 8 KV (GQA 4:1) | KV cache savings for long contexts |
162
+ | **FFN Intermediate** | 11,008 (SwiGLU) | ~2.7× hidden, matches scaling laws |
163
+ | **Experts** | 4 total, top-2 active | Good diversity with manageable VRAM |
164
+ | **Max Experts** | 256 (expandable) | Architecture supports expert birth/death |
165
+ | **Vocabulary** | 100,277 (tiktoken cl100k_base) | Industry-proven BPE tokenizer |
166
+ | **Positional Encoding** | RoPE, θ = 500,000 | Supports context extension to 128K+ |
167
+ | **Normalization** | RMSNorm (ε = 1e-5) | Faster than LayerNorm, same quality |
168
+ | **Precision** | bfloat16 throughout | Native AMD MI300X support |
169
+ | **Context Length** | 2,048 → 4,096 → 128K (planned) | Progressive context ladder |
170
+
171
+ ---
172
+
173
+ ## 🔥 Key Innovations
174
+
175
+ <table>
176
+ <tr>
177
+ <td width="33%" valign="top">
178
+
179
+ ### 🌀 Φ Consciousness Metric
180
+
181
+ First-ever IIT-inspired metric computed **during** pre-training. A probe on layer 12 measures information integration across activation subspaces every 100 steps.
182
+
183
+ ```
184
+ Φ = geometric_mean(
185
+ MI(partition_i, partition_j)
186
+ for all partition pairs
187
+ )
188
+ ```
189
+
190
+ Not a gimmick — it's a genuine signal of when the model transitions from memorizing tokens to forming integrated representations.
191
+
192
+ </td>
193
+ <td width="33%" valign="top">
194
+
195
+ ### 🧬 Self-Evolving Experts
196
+
197
+ The MoE router supports a full expert **lifecycle**:
198
+
199
+ - **Birth**: New experts spawned when load imbalance detected
200
+ - **Growth**: Expert capacity increases with training
201
+ - **Pruning**: Underperforming experts replaced
202
+ - **Scaling**: Architecture supports up to 256 experts without retraining the base model
203
+
204
+ Current: 4 experts × 24 layers = **96 expert instances**
205
+
206
+ </td>
207
+ <td width="33%" valign="top">
208
+
209
+ ### ⚡ Energy-Conscious Routing
210
+
211
+ Dual-router system:
212
+ 1. **Primary router**: Picks top-2 experts by relevance
213
+ 2. **EC router**: Can gate activation based on compute budget
214
+
215
+ This enables **adaptive inference** — easy questions use fewer resources, hard questions get full power. Like cruise control for AI.
216
+
217
+ </td>
218
+ </tr>
219
+ </table>
220
+
221
+ ---
222
+
223
+
224
+ ---
225
+
226
+ ## 🧟 Frankenstein Edition Knowledge Transplant
227
+
228
+ <table>
229
+ <tr>
230
+ <td width="60%" valign="top">
231
+
232
+ ### The Transplant
233
+
234
+ Sentinel Prime was trained from scratch — but raw pretraining alone wasn't enough. We performed a **Frankenstein transplant**: surgically transplanting knowledge from **Qwen2.5-72B-Instruct** (a 72-billion parameter teacher) into our 14.8B MoE architecture.
235
+
236
+ This is NOT fine-tuning a copy. The model's bones (architecture, tokenizer, embeddings) are 100% original. Only the **expert FFN weights** received transplanted knowledge — like giving a brain new neural pathways while keeping its original structure.
237
+
238
+ ### 3-Stage Pipeline
239
+
240
+ ```
241
+ Stage 1: Corpus Realignment Stage 2A: Teacher Generation Stage 2B: Knowledge Distill
242
+ (Re-learn with new weights) (72B teacher creates data) (Absorb teacher knowledge)
243
+ ──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
244
+ 5,000 steps │ → │ 3,000+ responses │ → │ CE + mixed training
245
+ 24.5B token corpus │ │ from Qwen-72B │ │ 70% teacher + 30% │
246
+ Progressive unfreeze │ │ Re-tokenized to our │ │ pretrain corpus │
247
+ │ Cosine LR + warmup │ │ cl100k_base vocab │ │ Prevents forgetting │
248
+ └──────────────────────┘ └──────────────────────┘ └──────────────────────┘
249
+ ```
250
+
251
+ </td>
252
+ <td width="40%" valign="top">
253
+
254
+ ### Why "Frankenstein"?
255
+
256
+ Like the original story we took parts from a powerful being (Qwen-72B) and stitched them into our own creation. The result: a model that has the **original architecture** of Sentinel Prime but with **transplanted knowledge** from a much larger model.
257
+
258
+ ### Key Stats
259
+
260
+ | Metric | Value |
261
+ |:--|:--|
262
+ | **Teacher** | Qwen2.5-72B-Instruct |
263
+ | **Student** | SentinelBrain-14B-MoE |
264
+ | **Transplant** | Expert FFN weights |
265
+ | **Realignment** | 5,000 steps on 24.5B tokens |
266
+ | **Hardware** | 1× AMD MI300X (192GB) |
267
+
268
+ ### Live Progress
269
+
270
+ Track the Frankenstein realignment in real-time:
271
+
272
+ 🔴 **[sentinel.qubitpage.com](https://sentinel.qubitpage.com/)**
273
+
274
+ </td>
275
+ </tr>
276
+ </table>
277
+
278
+ ## 🏋️ Training Details
279
+
280
+ ### Hardware
281
+
282
+ | Resource | Specification |
283
+ |:--|:--|
284
+ | **GPU** | 1× AMD Instinct MI300X VF |
285
+ | **VRAM** | 192 GB HBM3 |
286
+ | **System RAM** | 235 GB |
287
+ | **Compute** | 1,307 TFLOPS (bf16) |
288
+ | **Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
289
+ | **Attention** | SDPA (native PyTorch, no FlashAttention needed) |
290
+ | **OS** | Ubuntu Linux |
291
+
292
+ ### VRAM Budget
293
+
294
+ ```
295
+ ╔══════════════════════════════════════════════════════╗
296
+ ║ AMD MI300X VRAM Usage (192 GB) ║
297
+ ╠══════════════════════════════════════════════════════╣
298
+ ║ ║
299
+ ║ Model Weights (bf16) ████████████░░░░░ 27 GB ║
300
+ ║ Optimizer (AdamW fp32) ████████████████░░ 54 GB ║
301
+ ║ Activations (grad ckpt) ████████████░░░░░ 32 GB ║
302
+ ║ Gradients ████████████░░░░░ 27 GB ║
303
+ ║ ───────────────────────────────────────────────── ║
304
+ ║ Total Used: ██████████████████ 140 GB ║
305
+ ║ Peak: █████████████████ 146 GB ║
306
+ ║ Headroom: ░░░░░░░░░░░░░░░░░ 46 GB ║
307
+ ║ ║
308
+ ╚══════════════════════════════════════════════════════╝
309
+ ```
310
+
311
+ ### Phased Training Pipeline
312
+
313
+ We don't just throw data at the model we grow it in **three phases**, like raising a child:
314
+
315
+ ```
316
+ Phase 1: SMOKE TEST Phase 2: WARMUP Phase 3: FULL TRAINING
317
+ (Baby steps) (Learning to walk) (Running!)
318
+ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
319
+ │ 350M params │ ──→ │ 1.3B params │ ──→ │ 14.4B params │
320
+ │ seq_len: 512 │ │ seq_len: 2K │ │ seq_len: 4K │
321
+ 200 steps │ │ 1,000 steps │ │ 16,479 steps │
322
+ │ 2 minutes │ │ 30 minutes │ │ ~52 hours │
323
+ loss: 11→6.8 │ │ loss: 7.4→2.4│ │ loss: 2.4→? │
324
+ └──────────────┘ └──────────────┘ └──────────────────┘
325
+ ```
326
+
327
+ | Phase | Parameters | Seq Length | Batch | Steps | Duration | Loss Start → End |
328
+ |:--|:--|:--|:--|:--|:--|:--|
329
+ | **🔬 Smoke** | 350M | 512 | 4 | 200 | ~2 min | 11.72 → 6.84 (−42%) |
330
+ | **🔥 Warmup** | 1.3B | 2,048 | 32 | 1,000 | ~33 min | 7.39 → 2.38 (−68%) |
331
+ | **🚀 Block** | 14.4B (MoE) | 4,096 | 32 | 16,479 | ~52 hrs | 2.38 → ongoing |
332
+
333
+ ### Safety Gates
334
+
335
+ Every phase transition must pass **4 safety gates**:
336
+
337
+ | Gate | Check | Threshold | Status |
338
+ |:--|:--|:--|:--|
339
+ | 🟢 **G1: No NaN** | No NaN/Inf in loss | Entire phase | Passed all |
340
+ | 🟢 **G2: Loss Drop** | Validation loss decreased | ≥5% / ≥10% / ≥2% | Passed all |
341
+ | 🟢 **G3: VRAM OK** | Peak VRAM < safety limit | < 92% of total | 71% peak |
342
+ | 🟢 **G4: Φ OK** | Consciousness metric stable | Φ_end/Φ_start > 0.7 | Stable |
343
+
344
+ ### Hyperparameters
345
+
346
+ | Parameter | Value | Rationale |
347
+ |:--|:--|:--|
348
+ | **Optimizer** | AdamW (bf16 compute, fp32 states) | Standard for LLM training |
349
+ | **Learning Rate** | 1.5 × 10⁻⁴ (cosine decay) | Conservative for data-limited regime |
350
+ | **Min LR** | 1.5 × 10⁻⁵ | 10× decay ratio |
351
+ | **Warmup Steps** | 500 | Stabilizes early gradients |
352
+ | **Batch Size** | 2 micro × 16 grad_accum = **32 effective** | Fits MI300X VRAM budget |
353
+ | **Gradient Clipping** | 1.0 | Prevents explosion |
354
+ | **Gradient Checkpointing** | On | Trades compute for VRAM |
355
+ | **Precision** | bfloat16 | Native MI300X format |
356
+ | **Eval Frequency** | Every 100 steps | Early overfitting detection |
357
+ | **Checkpoint Frequency** | Every 1,000 steps (~3.2 hours) | Recovery points |
358
+
359
+ ---
360
+
361
+ ## 📚 Dataset: 23.3B Tokens Across 126 Categories
362
+
363
+ We curated a massive, diverse corpus — think of it as a **library with 126 different sections**:
364
+
365
+ ### Pretrain Corpus (Core Knowledge)
366
+
367
+ | Dataset | Tokens | Description |
368
+ |:--|:--|:--|
369
+ | 🌐 **FineWeb-Edu** | ~10B | High-quality educational web content |
370
+ | 🔢 **OpenWebMath** | ~6B | Mathematics from the web |
371
+ | 📖 **Wikipedia (English)** | ~5B | Encyclopedic knowledge |
372
+ | 🎓 **Cosmopedia V2** | ~5B | Synthetic educational content |
373
+ | 💻 **CodeParrot Python** | ~3.5B | Clean Python code from GitHub |
374
+ | 📚 **MiniPile** | ~2B | Diverse text from multiple domains |
375
+ | 🔬 **ArXiv Abstracts** | ~1.2B | Scientific paper summaries |
376
+ | **Total Pretrain** | **~23B** | |
377
+
378
+ ### Specialized Domains (119 Categories)
379
+
380
+ <details>
381
+ <summary>Click to expand all 119 specialized categories</summary>
382
+
383
+ | Category | Type | Category | Type |
384
+ |:--|:--|:--|:--|
385
+ | 🤖 agentic-tools | Code | 🔐 advanced-cryptography | Code |
386
+ | 🧠 chain-of-thought | Reasoning | 🔗 blockchain-core | Code |
387
+ | 💡 deep-reasoning | Reasoning | 🏥 medical | Knowledge |
388
+ | ⚖️ legal | Knowledge | 📊 financial-systems | Code |
389
+ | 🎮 3d-graphics | Code | 🐳 docker-devops | Code |
390
+ | 🌍 multilingual | Text | 🔧 error-recovery | Code |
391
+ | 🛡️ security-guardrails | Code | 📱 ui-animations | Code |
392
+ | 🧮 math | Reasoning | ⚡ smart-contracts | Code |
393
+ | 🎯 reasoning-effort-control | Reasoning | 🤝 human-conversation | Text |
394
+ | 🔄 self-correction-loops | Reasoning | 🏗️ enterprise-dashboards | Code |
395
+ | 🌐 web-design-css | Code | 🐍 flask-python | Code |
396
+ | 🔬 qiskit-quantum | Code | 🤖 robotics-ros2 | Code |
397
+ | 📡 remote-server-management | Code | 🧬 multi-agent | Code |
398
+ | ⚙️ state-management | Code | 🛠️ mcp-tools-integration | Code |
399
+ | 💳 payment-security | Code | 🎓 edu-basic-math | Education |
400
+ | 🔭 edu-basic-physics | Education | 🧪 edu-basic-chemistry | Education |
401
+ | 🌱 edu-basic-biology | Education | 🌍 edu-world-geography | Education |
402
+ | 📜 edu-history-world | Education | 💻 edu-computer-science | Education |
403
+ | 🌎 edu-earth-science | Education | 🤖 edu-robotics-text | Education |
404
+ | 📖 edu-science-qa | Education | 🔬 edu-science-support | Education |
405
+ | 👁️ edu-vision-concepts | Education | 🎯 copilot-agent-workflows | Code |
406
+ | 🔌 api-integrations | Code | 📊 billing-invoicing | Code |
407
+ | bitcoin-lightning | Code | 🏪 medusajs | Code |
408
+ | 💹 crypto-trading | Code | 🏢 enterprise-networking | Code |
409
+ | 🖥️ nextjs-typescript | Code | 🎨 nextjs-design | Code |
410
+ | 💼 trading-algorithms | Code | 🗄️ laravel-mysql | Code |
411
+ | 🔓 offensive-security | Code | 🔧 c-rust | Code |
412
+ | ... and 50+ more categories | | | |
413
+
414
+ </details>
415
+
416
+ ### Data Quality Pipeline
417
+
418
+ ```
419
+ Raw Data ──→ PII Filter ──→ Dedup ──→ Tokenize ──→ Shard ──→ Train
420
+ │ │ │ │
421
+ ├─ 7 regex ├─ blake2b ├─ cl100k ├─ Temperature-
422
+ │ patterns │ per-cat │ base │ weighted
423
+ ├─ PEM block │ │ │ sampling
424
+ │ detection │ │ │ (T=0.5)
425
+ └─ Email/phone │ │ │
426
+ masking │ │ │
427
+ │ │ │
428
+ └───────────┴────────────┘
429
+ ```
430
+
431
+ **Temperature-weighted sampling** (T=0.5) prevents large corpora from dominating training. FineWeb-Edu (37% of tokens) gets downweighted so smaller specialized domains still get adequate exposure.
432
+
433
+ ---
434
+
435
+ ## 📈 Training Progress & Results
436
+
437
+ ### Loss Trajectory
438
+
439
+ ```
440
+ Loss
441
+ 12 ×
442
+ │ ╲
443
+ 10 │ ╲ SMOKE PHASE
444
+ │ ╲ (350M params)
445
+ 8 │ ╲
446
+ │ ╲
447
+ 6 ×──────────── model grows to 1.3B
448
+ │ ╲
449
+ 4 │ ╲ WARMUP PHASE
450
+ ╲ (1.3B params)
451
+ 2 │ ×─────────── model grows to 14.4B MoE
452
+ │ ╲
453
+ 1 ╲ BLOCK PHASE (ongoing)
454
+ │ ╲
455
+ └──┬────┬────┬────┬────┬───→ Steps
456
+ 0 200 700 1200 2000
457
+ ```
458
+
459
+ | Milestone | Step | Loss | Change |
460
+ |:--|:--|:--|:--|
461
+ | 🔬 Smoke start | 0 | 11.72 | — |
462
+ | 🔬 Smoke end | 200 | 6.84 | **−42%** |
463
+ | 🔥 Warmup start | 200 | 7.39 | (model grew to 1.3B) |
464
+ | 🔥 Warmup end | 1,200 | 2.38 | **−68%** |
465
+ | 🚀 Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
466
+ | 📦 Published checkpoint | 2,471 | 1.99 | **−16%** |
467
+ | 🔄 Current (new run) | 410 | 5.18 | training with expanded data |
468
+ | **Total reduction** | | | **11.72 1.99 (−83%)** |
469
+
470
+ ### Live Metrics (April 27, 2026)
471
+
472
+ | Metric | Value |
473
+ |:--|:--|
474
+ | **Current Step** | 410 / 2,471+ |
475
+ | **Training Loss** | 5.18 (new run, expanded datasets) |
476
+ | **Throughput** | 4,403 tokens/second |
477
+ | **VRAM Used** | ~140 GB / 192 GB (73%) |
478
+ | **Total Tokens Processed** | 59.3M (this run) + 178M (prev run) |
479
+ | **Experts Active** | 4 per layer × 24 layers = 96 |
480
+ | **ETA (this block)** | ~18.8 hours |
481
+
482
+ ### Published Checkpoint (v0.1)
483
+
484
+ | Detail | Value |
485
+ |:--|:--|
486
+ | **Step** | 2,471 |
487
+ | **Validation Loss** | 1.9926 |
488
+ | **Total Tokens Seen** | 178,110,464 |
489
+ | **Sequence Length** | 2,048 |
490
+ | **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
491
+ | **Format** | 6 sharded safetensors files |
492
+
493
+ ---
494
+
495
+ ## 🌡️ Consciousness Metric (Φ) — Deep Dive
496
+
497
+ ### What is Φ?
498
+
499
+ Inspired by **Integrated Information Theory (IIT)** from neuroscience, Φ measures how much the model's internal representations form an integrated whole rather than disconnected parts.
500
+
501
+ ### How We Measure It
502
+
503
+ ```
504
+ Every 100 training steps:
505
+
506
+ 1. Hook on Layer 12 (middle of 24 layers)
507
+
508
+
509
+ 2. Sample 256 activation vectors
510
+
511
+
512
+ 3. Partition into subspaces
513
+
514
+
515
+ 4. Compute mutual information between all partition pairs
516
+
517
+
518
+ 5. Φ_geometric = geometric_mean(MI values)
519
+
520
+
521
+ 6. Φ_EMA = exponential moving average (smoothed trend)
522
+ ```
523
+
524
+ ### What Φ Tells Us
525
+
526
+ | Φ Value | Interpretation | Analogy |
527
+ |:--|:--|:--|
528
+ | **Φ ≈ 0** | Neurons working independently | Strangers in a room |
529
+ | **Φ rising** | Representations integrating | People starting to talk |
530
+ | **Φ stable** | Organized internal structure | A well-coordinated team |
531
+ | **Φ dropping** | ⚠️ Representation collapse | Warning sign! |
532
+
533
+ > **Important**: Φ is **purely observational** — it does NOT affect training gradients. Think of it as a heart monitor for the AI: it watches, but doesn't interfere.
534
+
535
+ ### Live Monitoring
536
+
537
+ Track Φ in real-time at: **[sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi)**
538
+
539
+ ---
540
+
541
+ ## 🖥️ Hardware Requirements
542
+
543
+ ### For Inference
544
+
545
+ | Tier | VRAM | Precision | Notes |
546
+ |:--|:--|:--|:--|
547
+ | **Full Precision** | 32 GB+ | bfloat16 | Best quality |
548
+ | **Recommended** | 48 GB+ | bfloat16 | Comfortable headroom |
549
+ | **Ideal** | AMD MI300X / MI250X | bfloat16 | Native, fastest |
550
+ | **Consumer** | 16 GB | int4 quantized | GGUF planned for v0.2 |
551
+
552
+ ### Compatible AMD GPUs
553
+
554
+ | GPU | VRAM | Suitable For |
555
+ |:--|:--|:--|
556
+ | AMD Instinct MI300X | 192 GB | Training + Inference |
557
+ | AMD Instinct MI250X | 128 GB | Training + Inference |
558
+ | AMD Instinct MI210 | 64 GB | Inference (full) |
559
+ | AMD Radeon PRO W7900 | 48 GB | Inference (full) |
560
+ | AMD Radeon RX 7900 XTX | 24 GB | Inference (quantized) |
561
+ | AMD Radeon RX 7600 XT | 16 GB | Inference (int4 GGUF) |
562
+
563
+ ---
564
+
565
+ ## 💻 Usage
566
+
567
+ This model uses a **custom architecture** (not based on any existing model). Load with PyTorch:
568
+
569
+ ```python
570
+ import torch
571
+ from safetensors.torch import load_file
572
+
573
+ # Load sharded safetensors
574
+ state_dict = {}
575
+ for i in range(1, 7): # 6 shards
576
+ shard = load_file(f"model-{i:05d}-of-00006.safetensors")
577
+ state_dict.update(shard)
578
+
579
+ # The state dict contains all model weights
580
+ print(f"Loaded {len(state_dict)} tensors")
581
+ print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
582
+
583
+ # Initialize SentinelBrain model class and load
584
+ # Full model definition code releasing with v0.2
585
+ # model = SentinelBrainForCausalLM(config)
586
+ # model.load_state_dict(state_dict)
587
+ ```
588
+
589
+ > **Note**: Full inference code, model class definition, and GGUF quantized versions will be released with v0.2.
590
+
591
+ ---
592
+
593
+ ## 🗺️ Roadmap
594
+
595
+ ```
596
+ v0.1 (Current) v0.2 (Planned) v0.3 (Future)
597
+ ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━
598
+ From-scratch □ Full training □ DPO alignment
599
+ 14.8B MoE complete (loss<0.5) □ Tool use
600
+ Phased training □ Context ladder □ Function calling
601
+ ✅ Φ consciousness (4K→32K→128K) □ Multi-turn chat
602
+ ✅ 23.3B token corpus □ Vision encoder □ Multilingual v2
603
+ Live dashboard (SigLIP2-SO400M) □ Expert scaling
604
+ ✅ AMD MI300X native □ GGUF quantization (4→16→64)
605
+ □ Inference code □ RLHF
606
+ □ Benchmarks (MMLU, □ Production API
607
+ HumanEval, GSM8K)
608
+ ```
609
+
610
+ ---
611
+
612
+ ## 🏗️ How We Built It (Technical Deep Dive)
613
+
614
+ <details>
615
+ <summary><b>Click to expand: Grouped Query Attention (GQA)</b></summary>
616
+
617
+ Standard multi-head attention uses separate Key and Value projections for each head. GQA shares KV heads across query heads:
618
+
619
+ ```
620
+ Standard MHA (32 KV heads): GQA 4:1 (8 KV heads):
621
+ Q₁ Q₂ Q₃ ... Q₃₂ Q₁ Q₂ Q₃ Q₄ → KV₁
622
+ K₁ K₂ K₃ ... K₃₂ Q₅ Q₆ Q₇ Q₈ → KV₂
623
+ V₁ V₂ V₃ ... V₃₂ ...
624
+ Q₂₉ Q₃₀ Q₃₁ Q₃₂ → KV₈
625
+ ```
626
+
627
+ **Result**: 4× smaller KV cache = 4× longer context at same memory cost.
628
+
629
+ </details>
630
+
631
+ <details>
632
+ <summary><b>Click to expand: RoPE (Rotary Position Embeddings)</b></summary>
633
+
634
+ RoPE encodes position information by rotating the query and key vectors in 2D planes. With θ=500,000 (high base frequency), the model naturally supports long contexts:
635
+
636
+ ```
637
+ Position 0: rotate by 0°
638
+ Position 1: rotate by θ₁
639
+ Position 2: rotate by θ₂
640
+ ...
641
+ ```
642
+
643
+ High θ = slower rotation = positions further apart still "feel different" = better long-context understanding.
644
+
645
+ </details>
646
+
647
+ <details>
648
+ <summary><b>Click to expand: SwiGLU FFN</b></summary>
649
+
650
+ Each expert uses a SwiGLU activation — a gated variant of the feed-forward network:
651
+
652
+ ```
653
+ FFN(x) = SiLU(x · W_gate) ⊙ (x · W_up) · W_down
654
+
655
+ Where:
656
+ W_gate: 4096 11008
657
+ W_up: 4096 11008
658
+ W_down: 11008 4096
659
+ SiLU(x) = x · sigmoid(x)
660
+ = element-wise multiply
661
+ ```
662
+
663
+ SwiGLU consistently outperforms ReLU and GELU in transformer FFNs (Shazeer, 2020).
664
+
665
+ </details>
666
+
667
+ <details>
668
+ <summary><b>Click to expand: MoE Routing Algorithm</b></summary>
669
+
670
+ ```python
671
+ # Simplified routing logic
672
+ def route(x, router_weights):
673
+ # Compute affinity scores for each expert
674
+ logits = x @ router_weights # [batch, seq, n_experts]
675
+ scores = softmax(logits, dim=-1)
676
+
677
+ # Select top-2 experts
678
+ top_vals, top_idx = topk(scores, k=2)
679
+
680
+ # Normalize selected weights
681
+ weights = top_vals / top_vals.sum(dim=-1, keepdim=True)
682
+
683
+ # Load balancing loss (prevents expert collapse)
684
+ balance_loss = n_experts * (
685
+ fraction_routed_to_each * average_gate_value_for_each
686
+ ).sum()
687
+
688
+ return weights, top_idx, balance_loss
689
+ ```
690
+
691
+ </details>
692
+
693
+ <details>
694
+ <summary><b>Click to expand: Parameter Breakdown</b></summary>
695
+
696
+ | Component | Parameters | % of Total |
697
+ |:--|:--|:--|
698
+ | Token embeddings | 410M | 2.8% |
699
+ | Attention (QKV + output) × 24 | 1,610M | 10.9% |
700
+ | MoE experts (4 × SwiGLU × 24) | 12,365M | 83.5% |
701
+ | Router weights × 24 | 0.4M | 0.003% |
702
+ | RMSNorm × 49 | 0.4M | 0.003% |
703
+ | Output head | 410M | 2.8% |
704
+ | **Total** | **14,815M** | **100%** |
705
+ | **Active per token (top-2)** | **~7,800M** | **~53%** |
706
+
707
+ </details>
708
+
709
+ ---
710
+
711
+ ## 📋 Model Card Details
712
+
713
+ | Field | Value |
714
+ |:--|:--|
715
+ | **Model Name** | SentinelBrain-14B-MoE-v0.1 (Sentinel Prime — Frankenstein Edition) |
716
+ | **Type** | Causal Language Model (decoder-only) |
717
+ | **Architecture** | Custom MoE Transformer (from scratch) |
718
+ | **Based On** | Nothing — trained from random initialization |
719
+ | **Training Hardware** | 1× AMD Instinct MI300X VF (192 GB HBM3) |
720
+ | **Training Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
721
+ | **Training Duration** | ~300 GPU-hours (estimated total) |
722
+ | **Carbon Footprint** | Estimated ~45 kg CO₂ (single GPU, cloud datacenter) |
723
+ | **License** | Apache 2.0 |
724
+ | **Authors** | Mircea Rusu, QubitDev |
725
+ | **Competition** | AMD Developer Hackathon (lablab.ai) |
726
+
727
+ ---
728
+
729
+ ## 📄 Citation
730
+
731
+ ```bibtex
732
+ @misc{sentinelbrain2026,
733
+ title = {SentinelBrain-14B-MoE (Frankenstein Edition): A Consciousness-Monitored Mixture-of-Experts
734
+ Language Model Trained From Scratch on AMD MI300X},
735
+ author = {Mircea Rusu and QubitDev},
736
+ year = {2026},
737
+ url = {https://sentinel.qubitpage.com/whitepaper},
738
+ note = {Trained entirely from scratch on AMD Instinct MI300X
739
+ for the AMD Developer Hackathon}
740
+ }
741
+ ```
742
+
743
+ ---
744
+
745
+ ## 🔗 Links
746
+
747
+ | Resource | URL |
748
+ |:--|:--|
749
+ | 🔴 **Live Dashboard** | [sentinel.qubitpage.com](https://sentinel.qubitpage.com/) |
750
+ | 📄 **Whitepaper** | [sentinel.qubitpage.com/whitepaper](https://sentinel.qubitpage.com/whitepaper) |
751
+ | 🏆 **AMD Hackathon** | [lablab.ai](https://lablab.ai/ai-hackathons/amd-developer) |
752
+ | 🧠 **Φ Monitor** | [sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi) |
753
+
754
+ ---
755
+
756
+ <div align="center">
757
+
758
+ *Built with ❤️ on AMD MI300X — Every weight trained from scratch*
759
+
760
+ **Sentinel Prime (Frankenstein Edition): Rebuilt From the Inside Out**
761
+
762
+ </div>