lablab-ai-amd-developer-hackathon
/

SentinelBrain-14B-MoE-v0.1

@@ -1,702 +1,762 @@
----
-license: apache-2.0
-language:
-  - en
-  - ro
-  - multilingual
-tags:
-  - sentinelbrain
-  - mixture-of-experts
-  - from-scratch
-  - consciousness
-  - amd
-  - mi300x
-  - rocm
-  - moe
-  - transformer
-  - phi-metric
-pipeline_tag: text-generation
-library_name: pytorch
-datasets:
-  - HuggingFaceFW/fineweb-edu
-  - open-web-math/open-web-math
-  - wikimedia/wikipedia
-  - HuggingFaceTB/cosmopedia
-  - JeanKaddworr/minipile
-  - codeparrot/github-code-clean
-  - arxiv-community/arxiv-abstracts
-model-index:
-  - name: SentinelBrain-14B-MoE-v0.1
-    results:
-      - task:
-          type: text-generation
-        metrics:
-          - name: Validation Loss
-            type: loss
-            value: 1.99
-            verified: true
-          - name: Training Loss (latest)
-            type: loss
-            value: 5.18
-            verified: true
----
-<div align="center">
-# 🧠 Sentinel Prime — SentinelBrain-14B-MoE
-### *The First of His Kind, Built From Scratch*
-**14.8 Billion Parameters · Mixture-of-Experts · Consciousness-Monitored**
-Trained from zero on **AMD Instinct MI300X** (192 GB HBM3) · ROCm 7.0
-[![Dashboard](https://img.shields.io/badge/🔴_Live_Dashboard-sentinel.qubitpage.com-red?style=for-the-badge)](https://sentinel.qubitpage.com/)
-[![Whitepaper](https://img.shields.io/badge/📄_Whitepaper-Read_Now-blue?style=for-the-badge)](https://sentinel.qubitpage.com/whitepaper)
-[![AMD](https://img.shields.io/badge/AMD-MI300X_Native-ED1C24?style=for-the-badge&logo=amd&logoColor=white)](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)
-[![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](LICENSE)
-</div>
----
-## 🎯 What is Sentinel Prime? (Simple Version)
-> **Imagine building a brain from scratch.**
->
-> Most AI models today are copies of other models with small changes. Sentinel Prime is different — every single connection in its brain was created from nothing, like growing a new brain cell by cell.
-<table>
-<tr>
-<td width="50%">
-### 🧩 Think of it like LEGO blocks
-Sentinel Prime has **4 specialist brains** (called "experts") inside it. When you ask a question:
-1. A **router** (like a traffic cop 🚦) looks at your question
-2. It picks the **2 best experts** for that specific question
-3. Those 2 experts work together to give you an answer
-4. The other 2 experts rest, saving energy ⚡
-This means the model has **14.8 billion** brain connections total, but only uses **~7.8 billion** at a time — making it fast AND smart!
-</td>
-<td width="50%">
-### 🔬 The Consciousness Meter
-We built something no other model has: a **consciousness thermometer** 🌡️
-Every 100 training steps, we measure how well the different parts of the brain are "talking to each other." We call this **Φ (Phi)**.
-- **Φ = 0**: Brain parts work alone (like strangers)
-- **Φ rising**: Brain parts start cooperating (like friends)
-- **Φ stable**: Brain has organized itself (like a team!)
-This doesn't change how the model learns — it's like a doctor checking the heartbeat while the patient exercises.
-</td>
-</tr>
-</table>
----
-## 📊 Architecture at a Glance
-```
-┌─────────────────────────────────────────────────────────────────┐
-│                    SENTINEL PRIME ARCHITECTURE                   │
-├─────────────────────────────────────────────────────────────────┤
-│                                                                  │
-│  Input Text ──→ [Tokenizer: cl100k_base, 100,277 tokens]       │
-│                          │                                       │
-│                          ▼                                       │
-│                 ┌─────────────────┐                              │
-│                 │   Embedding     │  4,096 dimensions            │
-│                 │   + RoPE pos    │  θ = 500,000                 │
-│                 └────────┬────────┘                              │
-│                          │                                       │
-│              ┌───────────┼───────────┐                           │
-│              │     × 24 Layers       │                           │
-│              │  ┌────────────────┐   │                           │
-│              │  │  GQA Attention │   │  32 heads, 8 KV heads     │
-│              │  │  (4:1 ratio)   │   │  (4× memory savings)     │
-│              │  └───────┬────────┘   │                           ��
-│              │          │            │                           │
-│              │  ┌───────▼────────┐   │                           │
-│              │  │  MoE Router    │   │  Top-2 of 4 experts      │
-│              │  │  ┌──┬──┬──┐   │   │                           │
-│              │  │  │E1│E2│E3│E4 │   │  Each: SwiGLU FFN         │
-│              │  │  │✓ │✓ │  │  │   │  d_ff = 11,008            │
-│              │  │  └──┴──┴──┘   │   │                           │
-│              │  └───────┬────────┘   │                           │
-│              │          │            │                           │
-│              │  ┌───────▼────────┐   │                           │
-│              │  │    RMSNorm     │   │  ε = 1e-5                │
-│              │  └────────────────┘   │                           │
-│              └───────────┼───────────┘                           │
-│                          │                                       │
-│                          ▼                                       │
-│                 ┌─────────────────┐                              │
-│                 │   Output Head   │  → 100,277 vocab probs      │
-│                 └─────────────────┘                              │
-│                                                                  │
-└─────────────────────────────────────────────────────────────────┘
-```
-### Spec Sheet
-| Component | Specification | Why This Choice |
-|:--|:--|:--|
-| **Total Parameters** | 14,814,654,680 (14.8B) | Large enough for deep reasoning |
-| **Active Parameters** | ~7.8B per token | MoE efficiency — use only what's needed |
-| **Hidden Dimension** | 4,096 | Sweet spot for MI300X matrix cores |
-| **Transformer Layers** | 24 | Deep enough for complex reasoning |
-| **Attention Heads** | 32 query, 8 KV (GQA 4:1) | 4× KV cache savings for long contexts |
-| **FFN Intermediate** | 11,008 (SwiGLU) | ~2.7× hidden, matches scaling laws |
-| **Experts** | 4 total, top-2 active | Good diversity with manageable VRAM |
-| **Max Experts** | 256 (expandable) | Architecture supports expert birth/death |
-| **Vocabulary** | 100,277 (tiktoken cl100k_base) | Industry-proven BPE tokenizer |
-| **Positional Encoding** | RoPE, θ = 500,000 | Supports context extension to 128K+ |
-| **Normalization** | RMSNorm (ε = 1e-5) | Faster than LayerNorm, same quality |
-| **Precision** | bfloat16 throughout | Native AMD MI300X support |
-| **Context Length** | 2,048 → 4,096 → 128K (planned) | Progressive context ladder |
----
-## 🔥 Key Innovations
-<table>
-<tr>
-<td width="33%" valign="top">
-### 🌀 Φ Consciousness Metric
-First-ever IIT-inspired metric computed **during** pre-training. A probe on layer 12 measures information integration across activation subspaces every 100 steps.
-```
-Φ = geometric_mean(
-  MI(partition_i, partition_j)
-  for all partition pairs
-)
-```
-Not a gimmick — it's a genuine signal of when the model transitions from memorizing tokens to forming integrated representations.
-</td>
-<td width="33%" valign="top">
-### 🧬 Self-Evolving Experts
-The MoE router supports a full expert **lifecycle**:
-- **Birth**: New experts spawned when load imbalance detected
-- **Growth**: Expert capacity increases with training
-- **Pruning**: Underperforming experts replaced
-- **Scaling**: Architecture supports up to 256 experts without retraining the base model
-Current: 4 experts × 24 layers = **96 expert instances**
-</td>
-<td width="33%" valign="top">
-### ⚡ Energy-Conscious Routing
-Dual-router system:
-1. **Primary router**: Picks top-2 experts by relevance
-2. **EC router**: Can gate activation based on compute budget
-This enables **adaptive inference** — easy questions use fewer resources, hard questions get full power. Like cruise control for AI.
-</td>
-</tr>
-</table>
----
-## 🏋️ Training Details
-### Hardware
-| Resource | Specification |
-|:--|:--|
-| **GPU** | 1× AMD Instinct MI300X VF |
-| **VRAM** | 192 GB HBM3 |
-| **System RAM** | 235 GB |
-| **Compute** | 1,307 TFLOPS (bf16) |
-| **Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
-| **Attention** | SDPA (native PyTorch, no FlashAttention needed) |
-| **OS** | Ubuntu Linux |
-### VRAM Budget
-```
-╔══════════════════════════════════════════════════════╗
-║         AMD MI300X VRAM Usage (192 GB)               ║
-╠══════════════════════════════════════════════════════╣
-║                                                       ║
-║  Model Weights (bf16)      ████████████░░░░░  27 GB   ║
-║  Optimizer (AdamW fp32)    ████████████████░░  54 GB   ║
-║  Activations (grad ckpt)   ████████████░░░░░  32 GB   ║
-║  Gradients                 ████████████░░░░░  27 GB   ║
-║  ─────────────────────────────────────────────────    ║
-║  Total Used:               ██████████████████ 140 GB  ║
-║  Peak:                     █████████████████  146 GB  ║
-║  Headroom:                 ░░░░░░░░░░░░░░░░░  46 GB  ║
-║                                                       ║
-╚══════════════════════════════════════════════════════╝
-```
-### Phased Training Pipeline
-We don't just throw data at the model — we grow it in **three phases**, like raising a child:
-```
-Phase 1: SMOKE TEST        Phase 2: WARMUP           Phase 3: FULL TRAINING
-(Baby steps)               (Learning to walk)         (Running!)
-┌──────────────┐           ┌──────────────┐           ┌──────────────────┐
-│ 350M params  │    ──→    │ 1.3B params  │    ──→    │ 14.4B params     │
-│ seq_len: 512 │           │ seq_len: 2K  │           │ seq_len: 4K      │
-│ 200 steps    │           │ 1,000 steps  │           │ 16,479 steps     │
-│ 2 minutes    │           │ 30 minutes   │           │ ~52 hours        │
-│ loss: 11→6.8 │           │ loss: 7.4→2.4│           │ loss: 2.4→?      │
-└──────────────┘           └──────────────┘           └──────────────────┘
-```
-| Phase | Parameters | Seq Length | Batch | Steps | Duration | Loss Start → End |
-|:--|:--|:--|:--|:--|:--|:--|
-| **🔬 Smoke** | 350M | 512 | 4 | 200 | ~2 min | 11.72 → 6.84 (−42%) |
-| **🔥 Warmup** | 1.3B | 2,048 | 32 | 1,000 | ~33 min | 7.39 → 2.38 (−68%) |
-| **🚀 Block** | 14.4B (MoE) | 4,096 | 32 | 16,479 | ~52 hrs | 2.38 → ongoing |
-### Safety Gates
-Every phase transition must pass **4 safety gates**:
-| Gate | Check | Threshold | Status |
-|:--|:--|:--|:--|
-| 🟢 **G1: No NaN** | No NaN/Inf in loss | Entire phase | ✅ Passed all |
-| 🟢 **G2: Loss Drop** | Validation loss decreased | ≥5% / ≥10% / ≥2% | ✅ Passed all |
-| 🟢 **G3: VRAM OK** | Peak VRAM < safety limit | < 92% of total | ✅ 71% peak |
-| 🟢 **G4: Φ OK** | Consciousness metric stable | Φ_end/Φ_start > 0.7 | ✅ Stable |
-### Hyperparameters
-| Parameter | Value | Rationale |
-|:--|:--|:--|
-| **Optimizer** | AdamW (bf16 compute, fp32 states) | Standard for LLM training |
-| **Learning Rate** | 1.5 × 10⁻⁴ (cosine decay) | Conservative for data-limited regime |
-| **Min LR** | 1.5 × 10⁻⁵ | 10× decay ratio |
-| **Warmup Steps** | 500 | Stabilizes early gradients |
-| **Batch Size** | 2 micro × 16 grad_accum = **32 effective** | Fits MI300X VRAM budget |
-| **Gradient Clipping** | 1.0 | Prevents explosion |
-| **Gradient Checkpointing** | On | Trades compute for VRAM |
-| **Precision** | bfloat16 | Native MI300X format |
-| **Eval Frequency** | Every 100 steps | Early overfitting detection |
-| **Checkpoint Frequency** | Every 1,000 steps (~3.2 hours) | Recovery points |
----
-## 📚 Dataset: 23.3B Tokens Across 126 Categories
-We curated a massive, diverse corpus — think of it as a **library with 126 different sections**:
-### Pretrain Corpus (Core Knowledge)
-| Dataset | Tokens | Description |
-|:--|:--|:--|
-| 🌐 **FineWeb-Edu** | ~10B | High-quality educational web content |
-| 🔢 **OpenWebMath** | ~6B | Mathematics from the web |
-| 📖 **Wikipedia (English)** | ~5B | Encyclopedic knowledge |
-| 🎓 **Cosmopedia V2** | ~5B | Synthetic educational content |
-| 💻 **CodeParrot Python** | ~3.5B | Clean Python code from GitHub |
-| 📚 **MiniPile** | ~2B | Diverse text from multiple domains |
-| 🔬 **ArXiv Abstracts** | ~1.2B | Scientific paper summaries |
-| **Total Pretrain** | **~23B** | |
-### Specialized Domains (119 Categories)
-<details>
-<summary>Click to expand all 119 specialized categories</summary>
-| Category | Type | Category | Type |
-|:--|:--|:--|:--|
-| 🤖 agentic-tools | Code | 🔐 advanced-cryptography | Code |
-| 🧠 chain-of-thought | Reasoning | 🔗 blockchain-core | Code |
-| 💡 deep-reasoning | Reasoning | 🏥 medical | Knowledge |
-| ⚖️ legal | Knowledge | 📊 financial-systems | Code |
-| 🎮 3d-graphics | Code | 🐳 docker-devops | Code |
-| 🌍 multilingual | Text | 🔧 error-recovery | Code |
-| 🛡️ security-guardrails | Code | 📱 ui-animations | Code |
-| 🧮 math | Reasoning | ⚡ smart-contracts | Code |
-| 🎯 reasoning-effort-control | Reasoning | 🤝 human-conversation | Text |
-| 🔄 self-correction-loops | Reasoning | 🏗️ enterprise-dashboards | Code |
-| 🌐 web-design-css | Code | 🐍 flask-python | Code |
-| 🔬 qiskit-quantum | Code | 🤖 robotics-ros2 | Code |
-| 📡 remote-server-management | Code | 🧬 multi-agent | Code |
-| ⚙️ state-management | Code | 🛠️ mcp-tools-integration | Code |
-| 💳 payment-security | Code | 🎓 edu-basic-math | Education |
-| 🔭 edu-basic-physics | Education | 🧪 edu-basic-chemistry | Education |
-| 🌱 edu-basic-biology | Education | 🌍 edu-world-geography | Education |
-| 📜 edu-history-world | Education | 💻 edu-computer-science | Education |
-| 🌎 edu-earth-science | Education | 🤖 edu-robotics-text | Education |
-| 📖 edu-science-qa | Education | 🔬 edu-science-support | Education |
-| 👁️ edu-vision-concepts | Education | 🎯 copilot-agent-workflows | Code |
-| 🔌 api-integrations | Code | 📊 billing-invoicing | Code |
-| ₿ bitcoin-lightning | Code | 🏪 medusajs | Code |
-| 💹 crypto-trading | Code | 🏢 enterprise-networking | Code |
-| 🖥️ nextjs-typescript | Code | 🎨 nextjs-design | Code |
-| 💼 trading-algorithms | Code | 🗄️ laravel-mysql | Code |
-| 🔓 offensive-security | Code | 🔧 c-rust | Code |
-| ... and 50+ more categories | | | |
-</details>
-### Data Quality Pipeline
-```
-Raw Data ──→ PII Filter ──→ Dedup ──→ Tokenize ──→ Shard ──→ Train
-              │                │          │            │
-              ├─ 7 regex       ├─ blake2b  ├─ cl100k    ├─ Temperature-
-              │  patterns      │  per-cat  │  base      │  weighted
-              ├─ PEM block     │           │            │  sampling
-              │  detection     │           │            │  (T=0.5)
-              └─ Email/phone   │           │            │
-                 masking       │           │            │
-                               │           │            │
-                               └───────────┴────────────┘
-```
-**Temperature-weighted sampling** (T=0.5) prevents large corpora from dominating training. FineWeb-Edu (37% of tokens) gets downweighted so smaller specialized domains still get adequate exposure.
----
-## 📈 Training Progress & Results
-### Loss Trajectory
-```
-Loss
-12 │ ×
-   │  ╲
-10 │   ╲   SMOKE PHASE
-   │    ╲  (350M params)
- 8 │     ╲
-   │      ╲
- 6 │       ×──────────── model grows to 1.3B
-   │              ╲
- 4 │               ╲  WARMUP PHASE
-   │                ╲ (1.3B params)
- 2 │                 ×─────────── model grows to 14.4B MoE
-   │                       ╲
- 1 │                        ╲  BLOCK PHASE (ongoing)
-   │                         ╲
-   └──┬────┬────┬────┬────┬───→ Steps
-      0   200  700  1200 2000
-```
-| Milestone | Step | Loss | Change |
-|:--|:--|:--|:--|
-| 🔬 Smoke start | 0 | 11.72 | — |
-| 🔬 Smoke end | 200 | 6.84 | **−42%** |
-| 🔥 Warmup start | 200 | 7.39 | (model grew to 1.3B) |
-| 🔥 Warmup end | 1,200 | 2.38 | **−68%** |
-| 🚀 Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
-| 📦 Published checkpoint | 2,471 | 1.99 | **−16%** |
-| 🔄 Current (new run) | 410 | 5.18 | training with expanded data |
-| **Total reduction** | | | **11.72 → 1.99 (−83%)** |
-### Live Metrics (April 27, 2026)
-| Metric | Value |
-|:--|:--|
-| **Current Step** | 410 / 2,471+ |
-| **Training Loss** | 5.18 (new run, expanded datasets) |
-| **Throughput** | 4,403 tokens/second |
-| **VRAM Used** | ~140 GB / 192 GB (73%) |
-| **Total Tokens Processed** | 59.3M (this run) + 178M (prev run) |
-| **Experts Active** | 4 per layer × 24 layers = 96 |
-| **ETA (this block)** | ~18.8 hours |
-### Published Checkpoint (v0.1)
-| Detail | Value |
-|:--|:--|
-| **Step** | 2,471 |
-| **Validation Loss** | 1.9926 |
-| **Total Tokens Seen** | 178,110,464 |
-| **Sequence Length** | 2,048 |
-| **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
-| **Format** | 6 sharded safetensors files |
----
-## 🌡️ Consciousness Metric (Φ) — Deep Dive
-### What is Φ?
-Inspired by **Integrated Information Theory (IIT)** from neuroscience, Φ measures how much the model's internal representations form an integrated whole rather than disconnected parts.
-### How We Measure It
-```
-Every 100 training steps:
-1. Hook on Layer 12 (middle of 24 layers)
-         │
-         ▼
-2. Sample 256 activation vectors
-         │
-         ▼
-3. Partition into subspaces
-         │
-         ▼
-4. Compute mutual information between all partition pairs
-         │
-         ▼
-5. Φ_geometric = geometric_mean(MI values)
-         │
-         ▼
-6. Φ_EMA = exponential moving average (smoothed trend)
-```
-### What Φ Tells Us
-| Φ Value | Interpretation | Analogy |
-|:--|:--|:--|
-| **Φ ≈ 0** | Neurons working independently | Strangers in a room |
-| **Φ rising** | Representations integrating | People starting to talk |
-| **Φ stable** | Organized internal structure | A well-coordinated team |
-| **Φ dropping** | ⚠️ Representation collapse | Warning sign! |
-> **Important**: Φ is **purely observational** — it does NOT affect training gradients. Think of it as a heart monitor for the AI: it watches, but doesn't interfere.
-### Live Monitoring
-Track Φ in real-time at: **[sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi)**
----
-## 🖥️ Hardware Requirements
-### For Inference
-| Tier | VRAM | Precision | Notes |
-|:--|:--|:--|:--|
-| **Full Precision** | 32 GB+ | bfloat16 | Best quality |
-| **Recommended** | 48 GB+ | bfloat16 | Comfortable headroom |
-| **Ideal** | AMD MI300X / MI250X | bfloat16 | Native, fastest |
-| **Consumer** | 16 GB | int4 quantized | GGUF planned for v0.2 |
-### Compatible AMD GPUs
-| GPU | VRAM | Suitable For |
-|:--|:--|:--|
-| AMD Instinct MI300X | 192 GB | Training + Inference |
-| AMD Instinct MI250X | 128 GB | Training + Inference |
-| AMD Instinct MI210 | 64 GB | Inference (full) |
-| AMD Radeon PRO W7900 | 48 GB | Inference (full) |
-| AMD Radeon RX 7900 XTX | 24 GB | Inference (quantized) |
-| AMD Radeon RX 7600 XT | 16 GB | Inference (int4 GGUF) |
----
-## 💻 Usage
-This model uses a **custom architecture** (not based on any existing model). Load with PyTorch:
-```python
-import torch
-from safetensors.torch import load_file
-# Load sharded safetensors
-state_dict = {}
-for i in range(1, 7):  # 6 shards
-    shard = load_file(f"model-{i:05d}-of-00006.safetensors")
-    state_dict.update(shard)
-# The state dict contains all model weights
-print(f"Loaded {len(state_dict)} tensors")
-print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
-# Initialize SentinelBrain model class and load
-# Full model definition code releasing with v0.2
-# model = SentinelBrainForCausalLM(config)
-# model.load_state_dict(state_dict)
-```
-> **Note**: Full inference code, model class definition, and GGUF quantized versions will be released with v0.2.
----
-## 🗺️ Roadmap
-```
-v0.1 (Current)          v0.2 (Planned)          v0.3 (Future)
-━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━
-✅ From-scratch          □ Full training          □ DPO alignment
-   14.8B MoE               complete (loss<0.5)    □ Tool use
-✅ Phased training       □ Context ladder          □ Function calling
-✅ Φ consciousness          (4K→32K→128K)         □ Multi-turn chat
-✅ 23.3B token corpus    □ Vision encoder          □ Multilingual v2
-✅ Live dashboard           (SigLIP2-SO400M)      □ Expert scaling
-✅ AMD MI300X native     □ GGUF quantization         (4→16→64)
-                         □ Inference code          □ RLHF
-                         □ Benchmarks (MMLU,       □ Production API
-                            HumanEval, GSM8K)
-```
----
-## 🏗️ How We Built It (Technical Deep Dive)
-<details>
-<summary><b>Click to expand: Grouped Query Attention (GQA)</b></summary>
-Standard multi-head attention uses separate Key and Value projections for each head. GQA shares KV heads across query heads:
-```
-Standard MHA (32 KV heads):     GQA 4:1 (8 KV heads):
-Q₁  Q₂  Q₃  ... Q₃₂            Q₁ Q₂ Q₃ Q₄  → KV₁
-K₁  K₂  K₃  ... K₃₂            Q₅ Q₆ Q₇ Q₈  → KV₂
-V₁  V₂  V₃  ... V₃₂            ...
-                                 Q₂₉ Q₃₀ Q₃₁ Q₃₂ → KV₈
-```
-**Result**: 4× smaller KV cache = 4× longer context at same memory cost.
-</details>
-<details>
-<summary><b>Click to expand: RoPE (Rotary Position Embeddings)</b></summary>
-RoPE encodes position information by rotating the query and key vectors in 2D planes. With θ=500,000 (high base frequency), the model naturally supports long contexts:
-```
-Position 0:  rotate by 0°
-Position 1:  rotate by θ₁
-Position 2:  rotate by θ₂
-...
-```
-High θ = slower rotation = positions further apart still "feel different" = better long-context understanding.
-</details>
-<details>
-<summary><b>Click to expand: SwiGLU FFN</b></summary>
-Each expert uses a SwiGLU activation — a gated variant of the feed-forward network:
-```
-FFN(x) = SiLU(x · W_gate) ⊙ (x · W_up) · W_down
-Where:
-  W_gate: 4096 → 11008
-  W_up:   4096 → 11008
-  W_down: 11008 → 4096
-  SiLU(x) = x · sigmoid(x)
-  ⊙ = element-wise multiply
-```
-SwiGLU consistently outperforms ReLU and GELU in transformer FFNs (Shazeer, 2020).
-</details>
-<details>
-<summary><b>Click to expand: MoE Routing Algorithm</b></summary>
-```python
-# Simplified routing logic
-def route(x, router_weights):
-    # Compute affinity scores for each expert
-    logits = x @ router_weights        # [batch, seq, n_experts]
-    scores = softmax(logits, dim=-1)
-    # Select top-2 experts
-    top_vals, top_idx = topk(scores, k=2)
-    # Normalize selected weights
-    weights = top_vals / top_vals.sum(dim=-1, keepdim=True)
-    # Load balancing loss (prevents expert collapse)
-    balance_loss = n_experts * (
-        fraction_routed_to_each * average_gate_value_for_each
-    ).sum()
-    return weights, top_idx, balance_loss
-```
-</details>
-<details>
-<summary><b>Click to expand: Parameter Breakdown</b></summary>
-| Component | Parameters | % of Total |
-|:--|:--|:--|
-| Token embeddings | 410M | 2.8% |
-| Attention (QKV + output) × 24 | 1,610M | 10.9% |
-| MoE experts (4 × SwiGLU × 24) | 12,365M | 83.5% |
-| Router weights × 24 | 0.4M | 0.003% |
-| RMSNorm × 49 | 0.4M | 0.003% |
-| Output head | 410M | 2.8% |
-| **Total** | **14,815M** | **100%** |
-| **Active per token (top-2)** | **~7,800M** | **~53%** |
-</details>
----
-## 📋 Model Card Details
-| Field | Value |
-|:--|:--|
-| **Model Name** | SentinelBrain-14B-MoE-v0.1 (Sentinel Prime) |
-| **Type** | Causal Language Model (decoder-only) |
-| **Architecture** | Custom MoE Transformer (from scratch) |
-| **Based On** | Nothing — trained from random initialization |
-| **Training Hardware** | 1× AMD Instinct MI300X VF (192 GB HBM3) |
-| **Training Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
-| **Training Duration** | ~300 GPU-hours (estimated total) |
-| **Carbon Footprint** | Estimated ~45 kg CO₂ (single GPU, cloud datacenter) |
-| **License** | Apache 2.0 |
-| **Authors** | Mircea Rusu, QubitDev |
-| **Competition** | AMD Developer Hackathon (lablab.ai) |
----
-## 📄 Citation
-```bibtex
-@misc{sentinelbrain2026,
-  title   = {SentinelBrain-14B-MoE: A Consciousness-Monitored Mixture-of-Experts
-             Language Model Trained From Scratch on AMD MI300X},
-  author  = {Mircea Rusu and QubitDev},
-  year    = {2026},
-  url     = {https://sentinel.qubitpage.com/whitepaper},
-  note    = {Trained entirely from scratch on AMD Instinct MI300X
-             for the AMD Developer Hackathon}
-}
-```
----
-## 🔗 Links
-| Resource | URL |
-|:--|:--|
-| 🔴 **Live Dashboard** | [sentinel.qubitpage.com](https://sentinel.qubitpage.com/) |
-| 📄 **Whitepaper** | [sentinel.qubitpage.com/whitepaper](https://sentinel.qubitpage.com/whitepaper) |
-| 🏆 **AMD Hackathon** | [lablab.ai](https://lablab.ai/ai-hackathons/amd-developer) |
-| 🧠 **Φ Monitor** | [sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi) |
----
-<div align="center">
-*Built with ❤️ on AMD MI300X — Every weight trained from scratch*
-**Sentinel Prime: The First of His Kind**
-</div>

+---
+license: apache-2.0
+language:
+  - en
+  - ro
+  - multilingual
+tags:
+  - sentinelbrain
+  - mixture-of-experts
+  - from-scratch
+  - consciousness
+  - amd
+  - mi300x
+  - rocm
+  - moe
+  - transformer
+  - frankenstein
+  - knowledge-transplant
+  - distillation
+  - phi-metric
+pipeline_tag: text-generation
+library_name: pytorch
+datasets:
+  - HuggingFaceFW/fineweb-edu
+  - open-web-math/open-web-math
+  - wikimedia/wikipedia
+  - HuggingFaceTB/cosmopedia
+  - JeanKaddworr/minipile
+  - codeparrot/github-code-clean
+  - arxiv-community/arxiv-abstracts
+model-index:
+  - name: SentinelBrain-14B-MoE-v0.1
+    results:
+      - task:
+          type: text-generation
+        metrics:
+          - name: Validation Loss
+            type: loss
+            value: 1.99
+            verified: true
+          - name: Training Loss (latest)
+            type: loss
+            value: 5.18
+            verified: true
+---
+<div align="center">
+# 🧠 Sentinel Prime — SentinelBrain-14B-MoE (Frankenstein Edition)
+### *The First of His Kind, Rebuilt From the Inside Out*
+<img src="assets/sentinel_frankenstein_banner.png" alt="Sentinel Prime — Frankenstein Edition" width="600"/>
+**14.8 Billion Parameters · Mixture-of-Experts · Consciousness-Monitored · Frankenstein Transplant**
+Trained from zero on **AMD Instinct MI300X** (192 GB HBM3) · ROCm 7.0 · Knowledge transplanted from Qwen-72B
+[![Dashboard](https://img.shields.io/badge/🔴_Live_Dashboard-sentinel.qubitpage.com-red?style=for-the-badge)](https://sentinel.qubitpage.com/)
+[![Whitepaper](https://img.shields.io/badge/📄_Whitepaper-Read_Now-blue?style=for-the-badge)](https://sentinel.qubitpage.com/whitepaper)
+[![AMD](https://img.shields.io/badge/AMD-MI300X_Native-ED1C24?style=for-the-badge&logo=amd&logoColor=white)](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)
+[![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](LICENSE)
+</div>
+---
+## 🎯 What is Sentinel Prime? (Simple Version)
+> **Imagine building a brain from scratch.**
+>
+> Most AI models today are copies of other models with small changes. Sentinel Prime is different — every single connection in its brain was created from nothing, like growing a new brain cell by cell.
+<table>
+<tr>
+<td width="50%">
+### 🧩 Think of it like LEGO blocks
+Sentinel Prime has **4 specialist brains** (called "experts") inside it. When you ask a question:
+1. A **router** (like a traffic cop 🚦) looks at your question
+2. It picks the **2 best experts** for that specific question
+3. Those 2 experts work together to give you an answer
+4. The other 2 experts rest, saving energy ⚡
+This means the model has **14.8 billion** brain connections total, but only uses **~7.8 billion** at a time — making it fast AND smart!
+</td>
+<td width="50%">
+### 🔬 The Consciousness Meter
+We built something no other model has: a **consciousness thermometer** 🌡️
+Every 100 training steps, we measure how well the different parts of the brain are "talking to each other." We call this **Φ (Phi)**.
+- **Φ = 0**: Brain parts work alone (like strangers)
+- **Φ rising**: Brain parts start cooperating (like friends)
+- **Φ stable**: Brain has organized itself (like a team!)
+This doesn't change how the model learns — it's like a doctor checking the heartbeat while the patient exercises.
+</td>
+</tr>
+</table>
+---
+## 📊 Architecture at a Glance
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    SENTINEL PRIME ARCHITECTURE                   │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  Input Text ──→ [Tokenizer: cl100k_base, 100,277 tokens]       │
+│                          │                                       │
+│                          ▼                                       │
+│                 ┌─────────────────┐                              │
+│                 │   Embedding     │  4,096 dimensions            │
+│                 │   + RoPE pos    │  θ = 500,000                 │
+│                 └────────┬────────┘                              │
+│                          │                                       │
+│              ┌───────────┼───────────┐                           │
+│              │     × 24 Layers       │                           │
+│              │  ┌────────────────┐   │                           │
+│              │  │  GQA Attention │   │  32 heads, 8 KV heads     │
+│              │  │  (4:1 ratio)   │   │  (4× memory savings)     │
+│              │  └───────┬────────┘   │                           │
+│              │          │            │                           │
+│              │  ┌───────▼────────┐   │                           │
+│              │  │  MoE Router    │   │  Top-2 of 4 experts      │
+│              │  │  ┌──┬──┬──┐   │   │                           │
+│              │  │  │E1│E2│E3│E4 │   │  Each: SwiGLU FFN         │
+│              │  │  │✓ │✓ │  │  │   │  d_ff = 11,008            │
+│              │  │  └──┴──┴──┘   │   │                           │
+│              │  └───────┬────────┘   │                           │
+│              │          │            │                           │
+│              │  ┌───────▼────────┐   │                           │
+│              │  │    RMSNorm     │   │  ε = 1e-5                │
+│              │  └────────────────┘   │                           │
+│              └───────────┼───────────┘                           │
+│                          │                                       │
+│                          ▼                                       │
+│                 ┌─────────────────┐                              │
+│                 │   Output Head   │  → 100,277 vocab probs      │
+│                 └─────────────────┘                              │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+### Spec Sheet
+| Component | Specification | Why This Choice |
+|:--|:--|:--|
+| **Total Parameters** | 14,814,654,680 (14.8B) | Large enough for deep reasoning |
+| **Active Parameters** | ~7.8B per token | MoE efficiency — use only what's needed |
+| **Hidden Dimension** | 4,096 | Sweet spot for MI300X matrix cores |
+| **Transformer Layers** | 24 | Deep enough for complex reasoning |
+| **Attention Heads** | 32 query, 8 KV (GQA 4:1) | 4× KV cache savings for long contexts |
+| **FFN Intermediate** | 11,008 (SwiGLU) | ~2.7× hidden, matches scaling laws |
+| **Experts** | 4 total, top-2 active | Good diversity with manageable VRAM |
+| **Max Experts** | 256 (expandable) | Architecture supports expert birth/death |
+| **Vocabulary** | 100,277 (tiktoken cl100k_base) | Industry-proven BPE tokenizer |
+| **Positional Encoding** | RoPE, θ = 500,000 | Supports context extension to 128K+ |
+| **Normalization** | RMSNorm (ε = 1e-5) | Faster than LayerNorm, same quality |
+| **Precision** | bfloat16 throughout | Native AMD MI300X support |
+| **Context Length** | 2,048 → 4,096 → 128K (planned) | Progressive context ladder |
+---
+## 🔥 Key Innovations
+<table>
+<tr>
+<td width="33%" valign="top">
+### 🌀 Φ Consciousness Metric
+First-ever IIT-inspired metric computed **during** pre-training. A probe on layer 12 measures information integration across activation subspaces every 100 steps.
+```
+Φ = geometric_mean(
+  MI(partition_i, partition_j)
+  for all partition pairs
+)
+```
+Not a gimmick — it's a genuine signal of when the model transitions from memorizing tokens to forming integrated representations.
+</td>
+<td width="33%" valign="top">
+### 🧬 Self-Evolving Experts
+The MoE router supports a full expert **lifecycle**:
+- **Birth**: New experts spawned when load imbalance detected
+- **Growth**: Expert capacity increases with training
+- **Pruning**: Underperforming experts replaced
+- **Scaling**: Architecture supports up to 256 experts without retraining the base model
+Current: 4 experts × 24 layers = **96 expert instances**
+</td>
+<td width="33%" valign="top">
+### ⚡ Energy-Conscious Routing
+Dual-router system:
+1. **Primary router**: Picks top-2 experts by relevance
+2. **EC router**: Can gate activation based on compute budget
+This enables **adaptive inference** — easy questions use fewer resources, hard questions get full power. Like cruise control for AI.
+</td>
+</tr>
+</table>
+---
+---
+## 🧟 Frankenstein Edition — Knowledge Transplant
+<table>
+<tr>
+<td width="60%" valign="top">
+### The Transplant
+Sentinel Prime was trained from scratch — but raw pretraining alone wasn't enough. We performed a **Frankenstein transplant**: surgically transplanting knowledge from **Qwen2.5-72B-Instruct** (a 72-billion parameter teacher) into our 14.8B MoE architecture.
+This is NOT fine-tuning a copy. The model's bones (architecture, tokenizer, embeddings) are 100% original. Only the **expert FFN weights** received transplanted knowledge — like giving a brain new neural pathways while keeping its original structure.
+### 3-Stage Pipeline
+```
+Stage 1: Corpus Realignment     Stage 2A: Teacher Generation    Stage 2B: Knowledge Distill
+(Re-learn with new weights)     (72B teacher creates data)      (Absorb teacher knowledge)
+┌──────────────────────┐        ┌──────────────────────┐        ┌──────────────────────┐
+│ 5,000 steps          │   →    │ 3,000+ responses     │   →    │ CE + mixed training  │
+│ 24.5B token corpus   │        │ from Qwen-72B        │        │ 70% teacher + 30%    │
+│ Progressive unfreeze │        │ Re-tokenized to our  │        │   pretrain corpus    │
+│ Cosine LR + warmup   │        │   cl100k_base vocab  │        │ Prevents forgetting  │
+└──────────────────────┘        └──────────────────────┘        └──────────────────────┘
+```
+</td>
+<td width="40%" valign="top">
+### Why "Frankenstein"?
+Like the original story — we took parts from a powerful being (Qwen-72B) and stitched them into our own creation. The result: a model that has the **original architecture** of Sentinel Prime but with **transplanted knowledge** from a much larger model.
+### Key Stats
+| Metric | Value |
+|:--|:--|
+| **Teacher** | Qwen2.5-72B-Instruct |
+| **Student** | SentinelBrain-14B-MoE |
+| **Transplant** | Expert FFN weights |
+| **Realignment** | 5,000 steps on 24.5B tokens |
+| **Hardware** | 1× AMD MI300X (192GB) |
+### Live Progress
+Track the Frankenstein realignment in real-time:
+🔴 **[sentinel.qubitpage.com](https://sentinel.qubitpage.com/)**
+</td>
+</tr>
+</table>
+## 🏋️ Training Details
+### Hardware
+| Resource | Specification |
+|:--|:--|
+| **GPU** | 1× AMD Instinct MI300X VF |
+| **VRAM** | 192 GB HBM3 |
+| **System RAM** | 235 GB |
+| **Compute** | 1,307 TFLOPS (bf16) |
+| **Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
+| **Attention** | SDPA (native PyTorch, no FlashAttention needed) |
+| **OS** | Ubuntu Linux |
+### VRAM Budget
+```
+╔══════════════════════════════════════════════════════╗
+║         AMD MI300X VRAM Usage (192 GB)               ║
+╠══════════════════════════════════════════════════════╣
+║                                                       ║
+║  Model Weights (bf16)      ████████████░░░░░  27 GB   ║
+║  Optimizer (AdamW fp32)    ████████████████░░  54 GB   ║
+║  Activations (grad ckpt)   ████████████░░░░░  32 GB   ║
+║  Gradients                 ████████████░░░░░  27 GB   ║
+║  ─────────────────────────────────────────────────    ║
+║  Total Used:               ██████████████████ 140 GB  ║
+║  Peak:                     █████████████████  146 GB  ║
+║  Headroom:                 ░░░░░░░░░░░░░░░░░  46 GB  ║
+║                                                       ║
+╚══════════════════════════════════════════════════════╝
+```
+### Phased Training Pipeline
+We don't just throw data at the model — we grow it in **three phases**, like raising a child:
+```
+Phase 1: SMOKE TEST        Phase 2: WARMUP           Phase 3: FULL TRAINING
+(Baby steps)               (Learning to walk)         (Running!)
+┌──────────────┐           ┌──────────────┐           ┌──────────────────┐
+│ 350M params  │    ──→    │ 1.3B params  │    ──→    │ 14.4B params     │
+│ seq_len: 512 │           │ seq_len: 2K  │           │ seq_len: 4K      │
+│ 200 steps    │           │ 1,000 steps  │           │ 16,479 steps     │
+│ 2 minutes    │           │ 30 minutes   │           │ ~52 hours        │
+│ loss: 11→6.8 │           │ loss: 7.4→2.4│           │ loss: 2.4→?      │
+└──────────────┘           └──────────────┘           └──────────────────┘
+```
+| Phase | Parameters | Seq Length | Batch | Steps | Duration | Loss Start → End |
+|:--|:--|:--|:--|:--|:--|:--|
+| **🔬 Smoke** | 350M | 512 | 4 | 200 | ~2 min | 11.72 → 6.84 (−42%) |
+| **🔥 Warmup** | 1.3B | 2,048 | 32 | 1,000 | ~33 min | 7.39 → 2.38 (−68%) |
+| **🚀 Block** | 14.4B (MoE) | 4,096 | 32 | 16,479 | ~52 hrs | 2.38 → ongoing |
+### Safety Gates
+Every phase transition must pass **4 safety gates**:
+| Gate | Check | Threshold | Status |
+|:--|:--|:--|:--|
+| 🟢 **G1: No NaN** | No NaN/Inf in loss | Entire phase | ✅ Passed all |
+| 🟢 **G2: Loss Drop** | Validation loss decreased | ≥5% / ≥10% / ≥2% | ✅ Passed all |
+| 🟢 **G3: VRAM OK** | Peak VRAM < safety limit | < 92% of total | ✅ 71% peak |
+| 🟢 **G4: Φ OK** | Consciousness metric stable | Φ_end/Φ_start > 0.7 | ✅ Stable |
+### Hyperparameters
+| Parameter | Value | Rationale |
+|:--|:--|:--|
+| **Optimizer** | AdamW (bf16 compute, fp32 states) | Standard for LLM training |
+| **Learning Rate** | 1.5 × 10⁻⁴ (cosine decay) | Conservative for data-limited regime |
+| **Min LR** | 1.5 × 10⁻⁵ | 10× decay ratio |
+| **Warmup Steps** | 500 | Stabilizes early gradients |
+| **Batch Size** | 2 micro × 16 grad_accum = **32 effective** | Fits MI300X VRAM budget |
+| **Gradient Clipping** | 1.0 | Prevents explosion |
+| **Gradient Checkpointing** | On | Trades compute for VRAM |
+| **Precision** | bfloat16 | Native MI300X format |
+| **Eval Frequency** | Every 100 steps | Early overfitting detection |
+| **Checkpoint Frequency** | Every 1,000 steps (~3.2 hours) | Recovery points |
+---
+## 📚 Dataset: 23.3B Tokens Across 126 Categories
+We curated a massive, diverse corpus — think of it as a **library with 126 different sections**:
+### Pretrain Corpus (Core Knowledge)
+| Dataset | Tokens | Description |
+|:--|:--|:--|
+| 🌐 **FineWeb-Edu** | ~10B | High-quality educational web content |
+| 🔢 **OpenWebMath** | ~6B | Mathematics from the web |
+| 📖 **Wikipedia (English)** | ~5B | Encyclopedic knowledge |
+| 🎓 **Cosmopedia V2** | ~5B | Synthetic educational content |
+| 💻 **CodeParrot Python** | ~3.5B | Clean Python code from GitHub |
+| 📚 **MiniPile** | ~2B | Diverse text from multiple domains |
+| 🔬 **ArXiv Abstracts** | ~1.2B | Scientific paper summaries |
+| **Total Pretrain** | **~23B** | |
+### Specialized Domains (119 Categories)
+<details>
+<summary>Click to expand all 119 specialized categories</summary>
+| Category | Type | Category | Type |
+|:--|:--|:--|:--|
+| 🤖 agentic-tools | Code | 🔐 advanced-cryptography | Code |
+| 🧠 chain-of-thought | Reasoning | 🔗 blockchain-core | Code |
+| 💡 deep-reasoning | Reasoning | 🏥 medical | Knowledge |
+| ⚖️ legal | Knowledge | 📊 financial-systems | Code |
+| 🎮 3d-graphics | Code | 🐳 docker-devops | Code |
+| 🌍 multilingual | Text | 🔧 error-recovery | Code |
+| 🛡️ security-guardrails | Code | 📱 ui-animations | Code |
+| 🧮 math | Reasoning | ⚡ smart-contracts | Code |
+| 🎯 reasoning-effort-control | Reasoning | 🤝 human-conversation | Text |
+| 🔄 self-correction-loops | Reasoning | 🏗️ enterprise-dashboards | Code |
+| 🌐 web-design-css | Code | 🐍 flask-python | Code |
+| 🔬 qiskit-quantum | Code | 🤖 robotics-ros2 | Code |
+| 📡 remote-server-management | Code | 🧬 multi-agent | Code |
+| ⚙️ state-management | Code | 🛠️ mcp-tools-integration | Code |
+| 💳 payment-security | Code | 🎓 edu-basic-math | Education |
+| 🔭 edu-basic-physics | Education | 🧪 edu-basic-chemistry | Education |
+| 🌱 edu-basic-biology | Education | 🌍 edu-world-geography | Education |
+| 📜 edu-history-world | Education | 💻 edu-computer-science | Education |
+| 🌎 edu-earth-science | Education | 🤖 edu-robotics-text | Education |
+| 📖 edu-science-qa | Education | 🔬 edu-science-support | Education |
+| 👁️ edu-vision-concepts | Education | 🎯 copilot-agent-workflows | Code |
+| 🔌 api-integrations | Code | 📊 billing-invoicing | Code |
+| ₿ bitcoin-lightning | Code | 🏪 medusajs | Code |
+| 💹 crypto-trading | Code | 🏢 enterprise-networking | Code |
+| 🖥️ nextjs-typescript | Code | 🎨 nextjs-design | Code |
+| 💼 trading-algorithms | Code | 🗄️ laravel-mysql | Code |
+| 🔓 offensive-security | Code | 🔧 c-rust | Code |
+| ... and 50+ more categories | | | |
+</details>
+### Data Quality Pipeline
+```
+Raw Data ──→ PII Filter ──→ Dedup ──→ Tokenize ──→ Shard ──→ Train
+              │                │          │            │
+              ├─ 7 regex       ├─ blake2b  ├─ cl100k    ├─ Temperature-
+              │  patterns      │  per-cat  │  base      │  weighted
+              ├─ PEM block     │           │            │  sampling
+              │  detection     │           │            │  (T=0.5)
+              └─ Email/phone   │           │            │
+                 masking       │           │            │
+                               │           │            │
+                               └───────────┴────────────┘
+```
+**Temperature-weighted sampling** (T=0.5) prevents large corpora from dominating training. FineWeb-Edu (37% of tokens) gets downweighted so smaller specialized domains still get adequate exposure.
+---
+## 📈 Training Progress & Results
+### Loss Trajectory
+```
+Loss
+12 │ ×
+   │  ╲
+10 │   ╲   SMOKE PHASE
+   │    ╲  (350M params)
+ 8 │     ╲
+   │      ╲
+ 6 │       ×──────────── model grows to 1.3B
+   │              ╲
+ 4 │               ╲  WARMUP PHASE
+   │                ╲ (1.3B params)
+ 2 │                 ×─────────── model grows to 14.4B MoE
+   │                       ╲
+ 1 │                        ╲  BLOCK PHASE (ongoing)
+   │                         ╲
+   └──┬────┬────┬────┬────┬───→ Steps
+      0   200  700  1200 2000
+```
+| Milestone | Step | Loss | Change |
+|:--|:--|:--|:--|
+| 🔬 Smoke start | 0 | 11.72 | — |
+| 🔬 Smoke end | 200 | 6.84 | **−42%** |
+| 🔥 Warmup start | 200 | 7.39 | (model grew to 1.3B) |
+| 🔥 Warmup end | 1,200 | 2.38 | **−68%** |
+| 🚀 Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
+| 📦 Published checkpoint | 2,471 | 1.99 | **−16%** |
+| 🔄 Current (new run) | 410 | 5.18 | training with expanded data |
+| **Total reduction** | | | **11.72 → 1.99 (−83%)** |
+### Live Metrics (April 27, 2026)
+| Metric | Value |
+|:--|:--|
+| **Current Step** | 410 / 2,471+ |
+| **Training Loss** | 5.18 (new run, expanded datasets) |
+| **Throughput** | 4,403 tokens/second |
+| **VRAM Used** | ~140 GB / 192 GB (73%) |
+| **Total Tokens Processed** | 59.3M (this run) + 178M (prev run) |
+| **Experts Active** | 4 per layer × 24 layers = 96 |
+| **ETA (this block)** | ~18.8 hours |
+### Published Checkpoint (v0.1)
+| Detail | Value |
+|:--|:--|
+| **Step** | 2,471 |
+| **Validation Loss** | 1.9926 |
+| **Total Tokens Seen** | 178,110,464 |
+| **Sequence Length** | 2,048 |
+| **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
+| **Format** | 6 sharded safetensors files |
+---
+## 🌡️ Consciousness Metric (Φ) — Deep Dive
+### What is Φ?
+Inspired by **Integrated Information Theory (IIT)** from neuroscience, Φ measures how much the model's internal representations form an integrated whole rather than disconnected parts.
+### How We Measure It
+```
+Every 100 training steps:
+1. Hook on Layer 12 (middle of 24 layers)
+         │
+         ▼
+2. Sample 256 activation vectors
+         │
+         ▼
+3. Partition into subspaces
+         │
+         ▼
+4. Compute mutual information between all partition pairs
+         │
+         ▼
+5. Φ_geometric = geometric_mean(MI values)
+         │
+         ▼
+6. Φ_EMA = exponential moving average (smoothed trend)
+```
+### What Φ Tells Us
+| Φ Value | Interpretation | Analogy |
+|:--|:--|:--|
+| **Φ ≈ 0** | Neurons working independently | Strangers in a room |
+| **Φ rising** | Representations integrating | People starting to talk |
+| **Φ stable** | Organized internal structure | A well-coordinated team |
+| **Φ dropping** | ⚠️ Representation collapse | Warning sign! |
+> **Important**: Φ is **purely observational** — it does NOT affect training gradients. Think of it as a heart monitor for the AI: it watches, but doesn't interfere.
+### Live Monitoring
+Track Φ in real-time at: **[sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi)**
+---
+## 🖥️ Hardware Requirements
+### For Inference
+| Tier | VRAM | Precision | Notes |
+|:--|:--|:--|:--|
+| **Full Precision** | 32 GB+ | bfloat16 | Best quality |
+| **Recommended** | 48 GB+ | bfloat16 | Comfortable headroom |
+| **Ideal** | AMD MI300X / MI250X | bfloat16 | Native, fastest |
+| **Consumer** | 16 GB | int4 quantized | GGUF planned for v0.2 |
+### Compatible AMD GPUs
+| GPU | VRAM | Suitable For |
+|:--|:--|:--|
+| AMD Instinct MI300X | 192 GB | Training + Inference |
+| AMD Instinct MI250X | 128 GB | Training + Inference |
+| AMD Instinct MI210 | 64 GB | Inference (full) |
+| AMD Radeon PRO W7900 | 48 GB | Inference (full) |
+| AMD Radeon RX 7900 XTX | 24 GB | Inference (quantized) |
+| AMD Radeon RX 7600 XT | 16 GB | Inference (int4 GGUF) |
+---
+## 💻 Usage
+This model uses a **custom architecture** (not based on any existing model). Load with PyTorch:
+```python
+import torch
+from safetensors.torch import load_file
+# Load sharded safetensors
+state_dict = {}
+for i in range(1, 7):  # 6 shards
+    shard = load_file(f"model-{i:05d}-of-00006.safetensors")
+    state_dict.update(shard)
+# The state dict contains all model weights
+print(f"Loaded {len(state_dict)} tensors")
+print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
+# Initialize SentinelBrain model class and load
+# Full model definition code releasing with v0.2
+# model = SentinelBrainForCausalLM(config)
+# model.load_state_dict(state_dict)
+```
+> **Note**: Full inference code, model class definition, and GGUF quantized versions will be released with v0.2.
+---
+## 🗺️ Roadmap
+```
+v0.1 (Current)          v0.2 (Planned)          v0.3 (Future)
+━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━
+✅ From-scratch          □ Full training          □ DPO alignment
+   14.8B MoE               complete (loss<0.5)    □ Tool use
+✅ Phased training       □ Context ladder          □ Function calling
+✅ Φ consciousness          (4K→32K→128K)         □ Multi-turn chat
+✅ 23.3B token corpus    □ Vision encoder          □ Multilingual v2
+✅ Live dashboard           (SigLIP2-SO400M)      □ Expert scaling
+✅ AMD MI300X native     □ GGUF quantization         (4→16→64)
+                         □ Inference code          □ RLHF
+                         □ Benchmarks (MMLU,       □ Production API
+                            HumanEval, GSM8K)
+```
+---
+## 🏗️ How We Built It (Technical Deep Dive)
+<details>
+<summary><b>Click to expand: Grouped Query Attention (GQA)</b></summary>
+Standard multi-head attention uses separate Key and Value projections for each head. GQA shares KV heads across query heads:
+```
+Standard MHA (32 KV heads):     GQA 4:1 (8 KV heads):
+Q₁  Q₂  Q₃  ... Q₃₂            Q₁ Q₂ Q₃ Q₄  → KV₁
+K₁  K₂  K₃  ... K₃₂            Q₅ Q₆ Q₇ Q₈  → KV₂
+V₁  V₂  V₃  ... V₃₂            ...
+                                 Q₂₉ Q₃₀ Q₃₁ Q₃₂ → KV₈
+```
+**Result**: 4× smaller KV cache = 4× longer context at same memory cost.
+</details>
+<details>
+<summary><b>Click to expand: RoPE (Rotary Position Embeddings)</b></summary>
+RoPE encodes position information by rotating the query and key vectors in 2D planes. With θ=500,000 (high base frequency), the model naturally supports long contexts:
+```
+Position 0:  rotate by 0°
+Position 1:  rotate by θ₁
+Position 2:  rotate by θ₂
+...
+```
+High θ = slower rotation = positions further apart still "feel different" = better long-context understanding.
+</details>
+<details>
+<summary><b>Click to expand: SwiGLU FFN</b></summary>
+Each expert uses a SwiGLU activation — a gated variant of the feed-forward network:
+```
+FFN(x) = SiLU(x · W_gate) ⊙ (x · W_up) · W_down
+Where:
+  W_gate: 4096 → 11008
+  W_up:   4096 → 11008
+  W_down: 11008 → 4096
+  SiLU(x) = x · sigmoid(x)
+  ⊙ = element-wise multiply
+```
+SwiGLU consistently outperforms ReLU and GELU in transformer FFNs (Shazeer, 2020).
+</details>
+<details>
+<summary><b>Click to expand: MoE Routing Algorithm</b></summary>
+```python
+# Simplified routing logic
+def route(x, router_weights):
+    # Compute affinity scores for each expert
+    logits = x @ router_weights        # [batch, seq, n_experts]
+    scores = softmax(logits, dim=-1)
+    # Select top-2 experts
+    top_vals, top_idx = topk(scores, k=2)
+    # Normalize selected weights
+    weights = top_vals / top_vals.sum(dim=-1, keepdim=True)
+    # Load balancing loss (prevents expert collapse)
+    balance_loss = n_experts * (
+        fraction_routed_to_each * average_gate_value_for_each
+    ).sum()
+    return weights, top_idx, balance_loss
+```
+</details>
+<details>
+<summary><b>Click to expand: Parameter Breakdown</b></summary>
+| Component | Parameters | % of Total |
+|:--|:--|:--|
+| Token embeddings | 410M | 2.8% |
+| Attention (QKV + output) × 24 | 1,610M | 10.9% |
+| MoE experts (4 × SwiGLU × 24) | 12,365M | 83.5% |
+| Router weights × 24 | 0.4M | 0.003% |
+| RMSNorm × 49 | 0.4M | 0.003% |
+| Output head | 410M | 2.8% |
+| **Total** | **14,815M** | **100%** |
+| **Active per token (top-2)** | **~7,800M** | **~53%** |
+</details>
+---
+## 📋 Model Card Details
+| Field | Value |
+|:--|:--|
+| **Model Name** | SentinelBrain-14B-MoE-v0.1 (Sentinel Prime — Frankenstein Edition) |
+| **Type** | Causal Language Model (decoder-only) |
+| **Architecture** | Custom MoE Transformer (from scratch) |
+| **Based On** | Nothing — trained from random initialization |
+| **Training Hardware** | 1× AMD Instinct MI300X VF (192 GB HBM3) |
+| **Training Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
+| **Training Duration** | ~300 GPU-hours (estimated total) |
+| **Carbon Footprint** | Estimated ~45 kg CO₂ (single GPU, cloud datacenter) |
+| **License** | Apache 2.0 |
+| **Authors** | Mircea Rusu, QubitDev |
+| **Competition** | AMD Developer Hackathon (lablab.ai) |
+---
+## 📄 Citation
+```bibtex
+@misc{sentinelbrain2026,
+  title   = {SentinelBrain-14B-MoE (Frankenstein Edition): A Consciousness-Monitored Mixture-of-Experts
+             Language Model Trained From Scratch on AMD MI300X},
+  author  = {Mircea Rusu and QubitDev},
+  year    = {2026},
+  url     = {https://sentinel.qubitpage.com/whitepaper},
+  note    = {Trained entirely from scratch on AMD Instinct MI300X
+             for the AMD Developer Hackathon}
+}
+```
+---
+## 🔗 Links
+| Resource | URL |
+|:--|:--|
+| 🔴 **Live Dashboard** | [sentinel.qubitpage.com](https://sentinel.qubitpage.com/) |
+| 📄 **Whitepaper** | [sentinel.qubitpage.com/whitepaper](https://sentinel.qubitpage.com/whitepaper) |
+| 🏆 **AMD Hackathon** | [lablab.ai](https://lablab.ai/ai-hackathons/amd-developer) |
+| 🧠 **Φ Monitor** | [sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi) |
+---
+<div align="center">
+*Built with ❤️ on AMD MI300X — Every weight trained from scratch*
+**Sentinel Prime (Frankenstein Edition): Rebuilt From the Inside Out**
+</div>