Spaces:

lablab-ai-amd-developer-hackathon
/

sentinel-prime-frankenstein-edition

Sleeping

App Files Files Community

qubitpage commited on 21 days ago

Commit

1490f00

verified ·

1 Parent(s): abf1448

Update live 6K production training dashboard

Browse files

Files changed (3) hide show

README.md +67 -19
app.py +543 -266
requirements.txt +6 -2

README.md CHANGED Viewed

@@ -1,31 +1,79 @@
 ---
-title: Sentinel Prime (Frankenstein Edition)
-emoji: 🛰️
-colorFrom: blue
-colorTo: green
 sdk: gradio
-sdk_version: 5.49.1
 app_file: app.py
-pinned: true
 license: apache-2.0
-short_description: Live Sentinel Prime training monitor.
 tags:
-  - amd-mi300x
   - mixture-of-experts
-  - live-training
-  - hackathon
 ---
-# Sentinel Prime (Frankenstein Edition)
-This Space is the public competition monitor for the Sentinel Prime realignment run.
-It includes:
-- A concise model card and whitepaper.
-- Live training status from the MI300X dashboard.
-- Current logs for realignment, watchdog, tokenization, fusion, and dataset history.
-- A chronological project timeline from checkpoint recovery through live realignment.
-- Session snapshots captured by the Space while judges watch the project.
-Source telemetry is read from `https://sentinel.qubitpage.com`.

 ---
+title: "Sentinel Prime Frankenstein Edition — Live 6K Training"
+emoji: 🧠
+colorFrom: purple
+colorTo: blue
 sdk: gradio
+sdk_version: "5.0.0"
 app_file: app.py
+pinned: false
 license: apache-2.0
+short_description: "14.4B MoE 6K production SFT — live on AMD MI300X"
 tags:
+  - sentinelbrain
   - mixture-of-experts
+  - amd
+  - mi300x
+  - rocm
+  - consciousness
+  - phi-metric
+  - training-dashboard
+  - language-model
+  - moe
 ---
+# 🧠 Sentinel Prime Frankenstein Edition — Live 6K Training Dashboard
+Watch a **14.4-billion parameter Mixture-of-Experts** model running the current **6K production SFT** live on an AMD Instinct MI300X (192 GB HBM3). This Space connects to the real training server and displays live metrics, current logs, architecture details, and the Φ integrated-information training signal when available.
+## What you're seeing
+| Component | Details |
+|---|---|
+| **Architecture** | Custom MoE: 24 layers, 4 experts (top-2 routing), GQA (32→8), SwiGLU, RMSNorm |
+| **Parameters** | 14.40B loaded in the current production SFT run |
+| **Training data** | 45,578 packed 6K sequences, 243.7M effective SFT tokens |
+| **Hardware** | AMD Instinct MI300X (192 GB HBM3) via AMD Developer Cloud |
+| **Framework** | PyTorch 2.10 + ROCm 7.0 |
+| **Novel metric** | Φ (phi) — integrated information theory applied to gradient flow |
+| **Tokenizer** | tiktoken cl100k_base (100,277 vocab) |
+| **Context** | 6,144 tokens for the current production SFT run |
+## Architecture highlights
+- **Mixture-of-Experts**: 4 experts per layer, top-2 gating — only 2 experts active per token, giving 14.4B total params with efficient active-parameter routing
+- **Grouped Query Attention**: 32 query heads → 8 key-value heads (4× memory reduction)
+- **SwiGLU activation**: `SiLU(xW₁) ⊙ xW₃` instead of standard ReLU — better gradient flow
+- **RoPE positional encoding**: θ=500,000 for long-context extrapolation
+- **Φ consciousness metric**: Measures how information integrates across layers during training — a proxy for "emergent understanding"
+## How it was built
+This is an entry in the **lablab.ai AMD Developer Hackathon**:
+- **Custom architecture** — no fine-tuning, no base model. SentinelBrain is trained from scratch
+- **126-category curriculum** — mathematics, code, science, philosophy, creative writing, medical, legal, and more
+- **AMD-native** — developed and trained entirely on AMD Instinct MI300X via ROCm
+## Links
+- 📄 [Whitepaper](https://sentinel.qubitpage.com/whitepaper) — Full technical paper with architecture diagrams
+- 📊 [Full Dashboard](https://sentinel.qubitpage.com) — Live monitoring dashboard (may require auth)
+- 🤗 [Model on HuggingFace](https://huggingface.co/lablab-ai-amd-developer-hackathon/SentinelBrain-14B-MoE-v0.1)
+## Limitations
+- The model is **actively training** — weights are not final
+- No inference endpoint yet (14.4B params requires GPU)
+- Metrics refresh every 30 seconds; network latency may cause brief stale readings
+- The Φ metric is experimental and should not be interpreted as literal consciousness
+## License
+Apache 2.0. Model weights, training code, and this Space are open.
+## Acknowledgements
+- AMD for the Developer Cloud credits and MI300X access
+- lablab.ai for organizing the hackathon
+- The open-source datasets that make large-scale training possible

app.py CHANGED Viewed

@@ -1,296 +1,573 @@
 from __future__ import annotations
-import json
-import os
-import re
 import time
 from datetime import datetime, timezone
-from pathlib import Path
-from typing import Any
 import gradio as gr
-import requests
-API_BASE = os.getenv("SENTINEL_API_BASE", "https://sentinel.qubitpage.com").rstrip("/")
-SNAPSHOT_PATH = Path("history/live_snapshots.jsonl")
-SNAPSHOT_PATH.parent.mkdir(parents=True, exist_ok=True)
-LOGS = {
-    "Realignment": "realign_v2",
-    "Watchdog": "realign_watchdog",
-    "Tokenizer": "tokenize",
-    "Dataset Download": "dataset_download",
-    "Fusion Models": "fusion_models",
-    "Transplant": "transplant",
-    "Post Prepare": "post_prepare",
-}
-MODEL_FACTS = {
-    "Model": "Sentinel Prime (Frankenstein Edition)",
-    "Type": "14.4B parameter decoder-only Mixture-of-Experts language model",
-    "Experts": "4 FFN experts, top-2 routing",
-    "Tokenizer": "cl100k_base, vocab 100,277",
-    "Current phase": "Text coherence repair / realignment before multimodal projector stages",
-    "Hardware": "AMD Instinct MI300X, ROCm PyTorch",
-}
-def now_utc() -> str:
-    return datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
-def fetch_json(path: str, timeout: int = 10) -> dict[str, Any]:
     try:
-        response = requests.get(f"{API_BASE}{path}", timeout=timeout)
-        response.raise_for_status()
-        return response.json()
-    except Exception as exc:
-        return {"_error": f"{type(exc).__name__}: {exc}"}
-def fetch_text(path: str, timeout: int = 10) -> str:
-    try:
-        response = requests.get(f"{API_BASE}{path}", timeout=timeout)
-        response.raise_for_status()
-        return response.text
-    except Exception as exc:
-        return f"FETCH_ERROR: {type(exc).__name__}: {exc}"
-def fmt_num(value: Any, suffix: str = "") -> str:
     if value is None:
-        return "n/a"
-    try:
-        if isinstance(value, float):
-            return f"{value:,.2f}{suffix}"
-        return f"{int(value):,}{suffix}"
-    except Exception:
-        return f"{value}{suffix}"
-def pct_bar(value: float, tone: str = "ok") -> str:
-    width = max(0, min(100, value))
-    cls = "bar-fill"
-    if tone == "warn" or value >= 90:
-        cls += " warn"
-    if tone == "bad" or value >= 97:
-        cls += " bad"
-    return f"<div class='bar'><div class='{cls}' style='width:{width:.1f}%'></div></div>"
-def process_alive(processes: dict[str, Any], label: str) -> bool:
-    for item in processes.get("processes", []):
-        if item.get("name") == label:
-            return bool(item.get("alive"))
-    return False
-def latest_training_line(log_text: str) -> str:
-    lines = [line for line in log_text.splitlines() if re.search(r"^\s*step\s+\d+/", line)]
-    return lines[-1] if lines else "No training step line observed yet."
-def infer_status(overview: dict[str, Any], processes: dict[str, Any], watchdog: str) -> tuple[str, str]:
-    if overview.get("_error"):
-        return "API offline", "bad"
-    trainer_alive = process_alive(processes, "Frankenstein realignment")
-    watchdog_alive = process_alive(processes, "Realignment watchdog")
-    stop_match = re.findall(r"STOP: .*", watchdog)
-    if trainer_alive:
-        return "Training live", "ok"
-    if stop_match:
-        return f"Paused by watchdog: {stop_match[-1].replace('STOP: ', '')}", "warn"
-    if watchdog_alive:
-        return "Watchdog armed, trainer not visible", "warn"
-    return "Trainer not running", "bad"
-def metric_cards(overview: dict[str, Any], system: dict[str, Any], processes: dict[str, Any], realign_log: str, watchdog_log: str) -> str:
-    training = overview.get("training", {}) if isinstance(overview, dict) else {}
-    shards = overview.get("shards", {}) if isinstance(overview, dict) else {}
-    gpu = system.get("vram", {}) if isinstance(system, dict) else {}
-    ram = system.get("ram", {}) if isinstance(system, dict) else {}
-    status, tone = infer_status(overview, processes, watchdog_log)
-    step_line = latest_training_line(realign_log)
-    step = training.get("current_step")
-    loss = training.get("train_loss")
-    lr = training.get("lr")
-    tok_s = training.get("tok_per_sec")
-    eta = training.get("eta_hrs")
-    vram_pct = float(gpu.get("pct") or 0)
-    ram_pct = float(ram.get("pct") or 0)
-    status_class = {"ok": "status-ok", "warn": "status-warn", "bad": "status-bad"}.get(tone, "status-bad")
-    return f"""
-<section class='hero'>
-  <div>
-    <p class='eyebrow'>Live competition monitor</p>
-    <h1>Sentinel Prime <span>(Frankenstein Edition)</span></h1>
-    <p class='subtitle'>A recovered 14.4B MoE checkpoint realigned on AMD MI300X with public logs, watchdog events, and project history for judges.</p>
-  </div>
-  <div class='status-pill {status_class}'>{status}</div>
-</section>
-<section class='grid metrics'>
-  <div class='panel'><span>Step</span><strong>{fmt_num(step)}</strong><small>Target 5,000</small></div>
-  <div class='panel'><span>Train loss</span><strong>{fmt_num(loss)}</strong><small>Latest dashboard metric</small></div>
-  <div class='panel'><span>Learning rate</span><strong>{lr if lr is not None else 'n/a'}</strong><small>Warmup and SGDR schedule</small></div>
-  <div class='panel'><span>Throughput</span><strong>{fmt_num(tok_s, ' tok/s')}</strong><small>Effective batch 48</small></div>
-  <div class='panel'><span>ETA</span><strong>{fmt_num(eta, ' h')}</strong><small>From trainer metrics</small></div>
-  <div class='panel'><span>Tokenized corpus</span><strong>{fmt_num(shards.get('tokens_b'), 'B')}</strong><small>{fmt_num(shards.get('categories'))} categories</small></div>
-</section>
-<section class='grid bars'>
-  <div class='panel wide'><div class='row'><span>VRAM</span><b>{fmt_num(gpu.get('used_gb'), 'GB')} / {fmt_num(gpu.get('total_gb'), 'GB')} ({fmt_num(vram_pct, '%')})</b></div>{pct_bar(vram_pct)}</div>
-  <div class='panel wide'><div class='row'><span>RAM</span><b>{fmt_num(ram.get('used_gb'), 'GB')} / {fmt_num(ram.get('total_gb'), 'GB')} ({fmt_num(ram_pct, '%')})</b></div>{pct_bar(ram_pct)}</div>
-</section>
-<section class='panel full'>
-  <span>Latest training line</span>
-  <pre>{step_line}</pre>
-</section>
-<p class='updated'>Updated {now_utc()} from {API_BASE}</p>
 """
-def process_table(processes: dict[str, Any]) -> str:
-    rows = []
-    for item in processes.get("processes", []):
-        alive = bool(item.get("alive"))
-        badge = "LIVE" if alive else "off"
-        cls = "live" if alive else "off"
-        pids = ", ".join(str(pid) for pid in item.get("pids", [])) or "-"
-        rows.append(f"<tr><td>{item.get('name')}</td><td><span class='{cls}'>{badge}</span></td><td>{pids}</td></tr>")
-    return "<table><thead><tr><th>Process</th><th>Status</th><th>PIDs</th></tr></thead><tbody>" + "".join(rows) + "</tbody></table>"
-def whitepaper_markdown() -> str:
-    path = Path("WHITEPAPER.md")
-    return path.read_text(encoding="utf-8") if path.exists() else "Whitepaper file missing."
-def history_markdown() -> str:
-    path = Path("HISTORY.md")
-    history = path.read_text(encoding="utf-8") if path.exists() else "History file missing."
-    snapshots = []
-    if SNAPSHOT_PATH.exists():
-        lines = SNAPSHOT_PATH.read_text(encoding="utf-8", errors="replace").splitlines()[-10:]
-        for line in lines:
-            try:
-                item = json.loads(line)
-                snapshots.append(f"- {item.get('ts')} - step {item.get('step')} - {item.get('status')} - loss {item.get('loss')}")
-            except Exception:
-                continue
-    if snapshots:
-        history += "\n\n## Space Session Snapshots\n\n" + "\n".join(snapshots)
-    return history
-def log_bundle() -> tuple[str, str, str, str]:
-    realign = fetch_text("/api/logs/realign_v2?n=220")
-    watchdog = fetch_text("/api/logs/realign_watchdog?n=160")
-    tokenize = fetch_text("/api/logs/tokenize?n=120")
-    old = []
-    for label, name in [("Fusion Models", "fusion_models"), ("Transplant", "transplant"), ("Dataset Download", "dataset_download")]:
-        old.append(f"===== {label} =====\n{fetch_text(f'/api/logs/{name}?n=80')}")
-    return realign, watchdog, tokenize, "\n\n".join(old)
-def save_snapshot(status: str, overview: dict[str, Any], system: dict[str, Any]) -> None:
-    training = overview.get("training", {}) if isinstance(overview, dict) else {}
-    payload = {
-        "ts": now_utc(),
-        "status": status,
-        "step": training.get("current_step"),
-        "loss": training.get("train_loss"),
-        "lr": training.get("lr"),
-        "tok_per_sec": training.get("tok_per_sec"),
-        "vram_pct": (system.get("vram", {}) or {}).get("pct"),
-    }
-    try:
-        with SNAPSHOT_PATH.open("a", encoding="utf-8") as handle:
-            handle.write(json.dumps(payload, ensure_ascii=True) + "\n")
-    except Exception:
-        pass
-def refresh() -> tuple[str, str, str, str, str, str, str, str]:
-    overview = fetch_json("/api/overview")
-    system = fetch_json("/api/system")
-    processes = fetch_json("/api/processes")
-    realign, watchdog, tokenize, old_logs = log_bundle()
-    status, _ = infer_status(overview, processes, watchdog)
-    save_snapshot(status, overview, system)
-    cards = metric_cards(overview, system, processes, realign, watchdog)
-    proc_html = process_table(processes)
-    return cards, proc_html, realign, watchdog, tokenize, old_logs, history_markdown(), whitepaper_markdown()
-CSS = """
-:root { --bg:#08111f; --panel:#111c2b; --panel2:#152436; --text:#f4f7fb; --muted:#9fb0c6; --line:#26384d; --green:#37d399; --blue:#62a8ff; --amber:#f5bd4f; --red:#ff6b6b; }
-.gradio-container { background: var(--bg); color: var(--text); font-family: Inter, ui-sans-serif, system-ui, -apple-system, Segoe UI, sans-serif; }
-.hero { display:flex; align-items:flex-start; justify-content:space-between; gap:24px; padding:28px; border:1px solid var(--line); background:linear-gradient(135deg, #101d2d, #0c1725); border-radius:8px; }
-.eyebrow { margin:0 0 8px; color:var(--green); font-size:13px; text-transform:uppercase; letter-spacing:0; font-weight:700; }
-h1 { margin:0; font-size:42px; line-height:1.05; letter-spacing:0; color:var(--text); }
-h1 span { color:#b9d8ff; }
-.subtitle { margin:12px 0 0; max-width:900px; color:var(--muted); font-size:16px; line-height:1.55; }
-.status-pill { white-space:normal; max-width:360px; padding:10px 14px; border-radius:8px; font-weight:800; border:1px solid var(--line); text-align:right; }
-.status-ok { color:#062016; background:var(--green); }
-.status-warn { color:#231800; background:var(--amber); }
-.status-bad { color:#2b0808; background:var(--red); }
-.grid { display:grid; gap:12px; margin-top:12px; }
-.metrics { grid-template-columns:repeat(6, minmax(120px, 1fr)); }
-.bars { grid-template-columns:repeat(2, minmax(240px, 1fr)); }
-.panel { background:var(--panel); border:1px solid var(--line); border-radius:8px; padding:16px; min-width:0; }
-.panel span { display:block; color:var(--muted); font-size:12px; text-transform:uppercase; letter-spacing:0; font-weight:700; }
-.panel strong { display:block; margin-top:8px; font-size:24px; color:var(--text); overflow-wrap:anywhere; }
-.panel small { display:block; margin-top:6px; color:var(--muted); }
-.wide { min-height:92px; }
-.full { margin-top:12px; }
-.row { display:flex; align-items:center; justify-content:space-between; gap:12px; }
-.row b { color:var(--text); font-size:14px; }
-.bar { height:10px; margin-top:14px; border-radius:8px; overflow:hidden; background:#07101d; border:1px solid var(--line); }
-.bar-fill { height:100%; background:linear-gradient(90deg, var(--green), var(--blue)); }
-.bar-fill.warn { background:linear-gradient(90deg, var(--amber), #ff8f5a); }
-.bar-fill.bad { background:linear-gradient(90deg, var(--red), #ff9d9d); }
-pre { margin:10px 0 0; padding:12px; background:#07101d; border:1px solid var(--line); border-radius:8px; color:#dbe8ff; white-space:pre-wrap; overflow-wrap:anywhere; }
-.updated { color:var(--muted); margin:10px 0 0; font-size:13px; }
-table { width:100%; border-collapse:collapse; background:var(--panel); border:1px solid var(--line); border-radius:8px; overflow:hidden; }
-th, td { padding:10px 12px; border-bottom:1px solid var(--line); text-align:left; color:var(--text); }
-th { color:var(--muted); font-size:12px; text-transform:uppercase; letter-spacing:0; }
-.live, .off { display:inline-block; min-width:54px; padding:4px 7px; border-radius:8px; text-align:center; font-size:12px; font-weight:800; }
-.live { background:rgba(55,211,153,.16); color:var(--green); border:1px solid rgba(55,211,153,.35); }
-.off { background:rgba(255,255,255,.06); color:var(--muted); border:1px solid var(--line); }
-.markdown-body, .prose { color:var(--text); }
-@media (max-width: 980px) { .metrics { grid-template-columns:repeat(2, minmax(0, 1fr)); } .bars { grid-template-columns:1fr; } .hero { flex-direction:column; } h1 { font-size:34px; } .status-pill { text-align:left; max-width:none; } }
-@media (max-width: 560px) { .metrics { grid-template-columns:1fr; } h1 { font-size:28px; } .hero { padding:20px; } }
 """
-with gr.Blocks(css=CSS, title="Sentinel Prime (Frankenstein Edition)") as demo:
-    cards = gr.HTML()
     with gr.Tabs():
-        with gr.Tab("Live Training"):
-            processes_html = gr.HTML()
-            with gr.Accordion("Realignment Log", open=True):
-                realign_log = gr.Code(language="shell", lines=18, interactive=False)
-            with gr.Accordion("Watchdog Log", open=True):
-                watchdog_log = gr.Code(language="shell", lines=10, interactive=False)
-            with gr.Accordion("Tokenizer Log", open=False):
-                tokenize_log = gr.Code(language="shell", lines=12, interactive=False)
-        with gr.Tab("History"):
-            history = gr.Markdown()
-            old_logs = gr.Code(language="shell", lines=18, interactive=False)
-        with gr.Tab("Whitepaper"):
-            whitepaper = gr.Markdown()
-        with gr.Tab("Model Card"):
-            gr.Markdown("\n".join([f"- **{k}:** {v}" for k, v in MODEL_FACTS.items()]))
-            gr.Markdown("[Mission Control](https://sentinel.qubitpage.com/) | [Organization](https://huggingface.co/lablab-ai-amd-developer-hackathon)")
-    refresh_btn = gr.Button("Refresh now", variant="primary")
-    outputs = [cards, processes_html, realign_log, watchdog_log, tokenize_log, old_logs, history, whitepaper]
-    demo.load(refresh, outputs=outputs)
-    refresh_btn.click(refresh, outputs=outputs)
-    timer = gr.Timer(value=30)
-    timer.tick(refresh, outputs=outputs)
 if __name__ == "__main__":
-    demo.launch()

+"""SentinelBrain-14B MoE — Live Training Dashboard (HuggingFace Space).
+Connects to the training server at sentinel.qubitpage.com and displays
+real-time metrics: loss curves, throughput, VRAM, the novel Φ consciousness
+metric, and architecture details.  Refreshes every 30 seconds.
+No model inference runs here — the 14.4B-param model is training on an
+AMD Instinct MI300X and this Space is a live window into that process.
+"""
 from __future__ import annotations
 import time
+import traceback
 from datetime import datetime, timezone
 import gradio as gr
+import httpx
+import plotly.graph_objects as go
+# ── Config ───────────────────────────────────────────────────────────────
+API_BASE = "https://sentinel.qubitpage.com"
+REFRESH_INTERVAL = 30  # seconds
+MODEL_PARAMS = "14,400,000,000"
+MODEL_NAME = "Sentinel Prime Frankenstein Edition"
+HF_SPACE = "lablab-ai-amd-developer-hackathon/sentinel-prime-frankenstein-edition"
+# ── API helpers ──────────────────────────────────────────────────────────
+_client = httpx.Client(timeout=15, follow_redirects=True)
+def _fetch(endpoint: str) -> dict:
+    """Fetch JSON from the training server API."""
+    try:
+        r = _client.get(f"{API_BASE}{endpoint}")
+        r.raise_for_status()
+        return r.json()
+    except Exception as e:
+        return {"_error": str(e)}
+def _fetch_text(endpoint: str) -> str:
+    try:
+        r = _client.get(f"{API_BASE}{endpoint}")
+        r.raise_for_status()
+        return r.text
+    except Exception as e:
+        return f"Cannot reach training server: {e}"
+def _safe(val, fmt=".2f", fallback="—"):
+    """Format a numeric value safely."""
+    if val is None:
+        return fallback
     try:
+        return f"{float(val):{fmt}}"
+    except (ValueError, TypeError):
+        return fallback
+# ── Formatters ───────────────────────────────────────────────────────────
+def _format_tokens(n: int | float | None) -> str:
+    if n is None:
+        return "—"
+    n = int(n)
+    if n >= 1_000_000_000:
+        return f"{n / 1e9:.2f}B"
+    if n >= 1_000_000:
+        return f"{n / 1e6:.1f}M"
+    if n >= 1_000:
+        return f"{n / 1e3:.1f}K"
+    return str(n)
+def _format_eta(hrs: float | None) -> str:
+    if hrs is None:
+        return "—"
+    h = int(hrs)
+    m = int((hrs - h) * 60)
+    return f"{h}h {m}m"
+def _phi_bar(value: float | None) -> str:
+    """Create a visual bar for phi value (0-1 range)."""
     if value is None:
+        return "—"
+    v = max(0, min(1, float(value)))
+    filled = int(v * 20)
+    bar = "█" * filled + "░" * (20 - filled)
+    return f"`{bar}` {v:.4f}"
+# ── Build live metrics display ───────────────────────────────────────────
+def fetch_overview():
+    """Fetch all metrics and return formatted display components."""
+    data = _fetch("/api/overview")
+    if "_error" in data:
+        error_msg = f"⚠️ **Cannot reach training server**: {data['_error']}\n\nThe server may be temporarily unavailable. Metrics will refresh automatically."
+        return error_msg, None, None, None, ""
+    t = data.get("training", {})
+    phi = t.get("phi", {})
+    model = t.get("model", {})
+    phase3 = t.get("phase3_dataset", {})
+    vram = data.get("vram", {})
+    ram = data.get("ram", {})
+    shards = data.get("shards", {})
+    # ── Training Status Card ─────────────────────────────────────────
+    phase = t.get("phase", "unknown")
+    phase_emoji = {"phase3_sft": "🟢", "training": "🟢", "warming": "🟡", "evaluating": "🔵", "idle": "⚪"}.get(phase, "⚫")
+    step = t.get("current_step", 0)
+    total_steps = t.get("batch_steps", 0)
+    progress = t.get("progress_pct", 0)
+    loss = t.get("train_loss")
+    val_loss = t.get("val_loss")
+    best_val = t.get("best_val")
+    tok_s = t.get("tok_per_sec")
+    eta = t.get("eta_hrs")
+    lr = t.get("lr")
+    gnorm = t.get("gnorm")
+    status_md = f"""## {phase_emoji} Training: **{phase.upper()}**
+| Metric | Value |
+|--------|-------|
+| **Step** | {step:,} / {total_steps:,} ({_safe(progress, '.1f')}%) |
+| **Training Loss** | {_safe(loss, '.4f')} |
+| **Validation Loss** | {_safe(val_loss, '.4f')} |
+| **Best Validation** | {_safe(best_val, '.4f')} |
+| **Throughput** | {_safe(tok_s, ',.0f')} tok/s |
+| **Learning Rate** | {_safe(lr, '.2e')} |
+| **Gradient Norm** | {_safe(gnorm, '.3f')} |
+| **ETA** | {_format_eta(eta)} |
+| **Context length** | {phase3.get('seq_len') or model.get('seq_len', '—')} tokens |
+| **Batch / grad accum** | {phase3.get('batch_size') or model.get('batch', '—')} / {phase3.get('grad_accum') or model.get('grad_accum', '—')} |
+### Hardware
+| Resource | Usage |
+|----------|-------|
+| **VRAM** | {_safe(vram.get('used_gb'), '.1f')} / {_safe(vram.get('total_gb'), '.1f')} GB ({_safe(vram.get('pct'), '.0f')}%) |
+| **RAM** | {_safe(ram.get('used_gb'), '.1f')} / {_safe(ram.get('total_gb'), '.1f')} GB ({_safe(ram.get('pct'), '.0f')}%) |
+| **GPU** | AMD Instinct MI300X (192 GB HBM3) |
+### Dataset
+| Stat | Value |
+|------|-------|
+| **Categories** | {shards.get('categories', '—')} |
+| **Total tokens** | {_format_tokens(shards.get('tokens'))} |
+| **Pretrain** | {_safe(shards.get('pretrain_tokens_b'), '.2f')}B tokens |
+| **SFT** | {_safe(shards.get('sft_tokens_b'), '.3f')}B tokens |
+| **6K production sequences** | {phase3.get('total_sequences', '—')} |
+| **6K packing efficiency** | {_safe((phase3.get('packing_efficiency') or 0) * 100, '.1f')}% |
+*Last updated: {datetime.now(timezone.utc).strftime('%H:%M:%S UTC')}*
 """
+    # ── Φ (Consciousness) Card ───────────────────────────────────────
+    phi_geo = phi.get("geometric")
+    phi_norm = phi.get("normalized")
+    phi_ema = phi.get("ema")
+    phi_trend = phi.get("trend", "—")
+    phi_arrow = phi.get("trend_arrow", "")
+    phi_slope = phi.get("trend_slope")
+    phi_conf = phi.get("trend_confidence", "—")
+    phi_md = f"""## 🧠 Φ Consciousness Metric
+The **Φ (phi)** metric measures integrated information flow across the model's
+layers during training — inspired by Integrated Information Theory (IIT).
+Higher Φ suggests more complex, interconnected representations emerging.
+| Metric | Value |
+|--------|-------|
+| **Φ Geometric** | {_phi_bar(phi_geo)} |
+| **Φ Normalized** | {_phi_bar(phi_norm)} |
+| **Φ EMA** | {_phi_bar(phi_ema)} |
+| **Trend** | {phi_arrow} {phi_trend} |
+| **Slope** | {_safe(phi_slope, '.6f')} |
+| **Confidence** | {phi_conf} |
+### What does Φ mean?
+- **Φ < 0.1** — Early training, layers acting independently
+- **Φ 0.1–0.3** — Information beginning to integrate across layers
+- **Φ 0.3–0.5** — Strong cross-layer information flow emerging
+- **Φ > 0.5** — High integration — complex representations forming
+- **Φ > 0.7** — Exceptional — approaching theoretical maximum for this architecture
+"""
+    # ── Phi History Chart ────────────────────────────────────────────
+    phi_chart = None
+    phi_recent = data.get("phi_recent", [])
+    if phi_recent and len(phi_recent) > 2:
+        steps_list = [p.get("step", i) for i, p in enumerate(phi_recent)]
+        geo_list = [p.get("geometric") for p in phi_recent]
+        norm_list = [p.get("normalized") for p in phi_recent]
+        ema_list = [p.get("ema") for p in phi_recent]
+        fig = go.Figure()
+        if any(v is not None for v in geo_list):
+            fig.add_trace(go.Scatter(
+                x=steps_list, y=geo_list, mode="lines",
+                name="Φ Geometric", line=dict(color="#8b5cf6", width=2),
+            ))
+        if any(v is not None for v in norm_list):
+            fig.add_trace(go.Scatter(
+                x=steps_list, y=norm_list, mode="lines",
+                name="Φ Normalized", line=dict(color="#06b6d4", width=2),
+            ))
+        if any(v is not None for v in ema_list):
+            fig.add_trace(go.Scatter(
+                x=steps_list, y=ema_list, mode="lines",
+                name="Φ EMA", line=dict(color="#f59e0b", width=2, dash="dot"),
+            ))
+        fig.update_layout(
+            title="Φ Consciousness Metric Over Training",
+            xaxis_title="Step",
+            yaxis_title="Φ Value",
+            template="plotly_white",
+            height=350,
+            margin=dict(l=50, r=20, t=50, b=40),
+            legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
+            plot_bgcolor="#fafafa",
+            paper_bgcolor="#fafafa",
+            font=dict(color="#1e293b"),
+        )
+        phi_chart = fig
+    # ── Loss Chart from recent history ───────────────────────────────
+    loss_chart = None
+    history = t.get("recent_history", [])
+    if history and len(history) > 1:
+        batch_nums = list(range(len(history)))
+        train_losses = [h.get("loss_end") or h.get("train_loss") for h in history]
+        val_losses = [h.get("val_end") or h.get("val_loss") for h in history]
+        fig2 = go.Figure()
+        if any(v is not None for v in train_losses):
+            fig2.add_trace(go.Scatter(
+                x=batch_nums, y=train_losses, mode="lines+markers",
+                name="Train Loss", line=dict(color="#ef4444", width=2),
+            ))
+        if any(v is not None for v in val_losses):
+            fig2.add_trace(go.Scatter(
+                x=batch_nums, y=val_losses, mode="lines+markers",
+                name="Val Loss", line=dict(color="#22c55e", width=2),
+            ))
+        fig2.update_layout(
+            title="Loss Over Recent Training Batches",
+            xaxis_title="Batch",
+            yaxis_title="Loss",
+            template="plotly_white",
+            height=350,
+            margin=dict(l=50, r=20, t=50, b=40),
+            legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
+            plot_bgcolor="#fafafa",
+            paper_bgcolor="#fafafa",
+            font=dict(color="#1e293b"),
+        )
+        loss_chart = fig2
+    # ── Checkpoints info ─────────────────────────────────────────────
+    ckpts = data.get("checkpoints", [])
+    ckpt_md = ""
+    if ckpts:
+        ckpt_md = "\n### Saved Checkpoints\n\n| Checkpoint | Val Loss | Tokens |\n|-----------|----------|--------|\n"
+        for c in ckpts[-5:]:
+            name = c.get("name", "—")
+            vloss = _safe(c.get("val_loss"), ".4f")
+            toks = _format_tokens(c.get("tokens_trained"))
+            ckpt_md += f"| {name} | {vloss} | {toks} |\n"
+    return status_md + ckpt_md, phi_md, phi_chart, loss_chart, ""
+def fetch_live_log():
+    text = _fetch_text("/api/logs/phase3_production_train_6k?n=120")
+    text = text.replace("```", "'''")
+    return f"```text\n{text}\n```"
+# ── Architecture diagram ─────────────────────────────────────────────────
+ARCHITECTURE_MD = f"""## 🏗️ SentinelBrain-14B MoE Architecture
+**{MODEL_PARAMS} parameters** — trained from scratch, no base model.
+```
+┌─────────────────────────────────────────────────────┐
+│                  Input Tokens                        │
+│            tiktoken cl100k_base (100,277)            │
+└───────────────────────┬─────────────────────────────┘
+                        │
+                        ▼
+┌─────────────────────────────────────────────────────┐
+│              Token Embedding (d=4096)                │
+│            + RoPE Positional Encoding                │
+│              θ=500,000 (128K capable)                │
+└───────────────────────┬─────────────────────────────┘
+                        │
+              ┌─────────▼─────────┐
+              │   × 24 Layers     │
+              │                   │
+              │  ┌─────────────┐  │
+              │  │  RMSNorm    │  │
+              │  └──────┬──────┘  │
+              │         ▼         │
+              │  ┌─────────────┐  │
+              │  │   GQA       │  │
+              │  │  32Q → 8KV  │  │
+              │  │  (4× save)  │  │
+              │  └──────┬──────┘  │
+              │         ▼         │
+              │  ┌─────────────┐  │
+              │  │  RMSNorm    │  │
+              │  └──────┬──────┘  │
+              │         ▼         │
+              │  ┌─────────────┐  │
+              │  │  MoE Block  │  │
+              │  │  4 experts  │  │
+              │  │  top-2 gate │  │
+              │  │  SwiGLU FFN │  │
+              │  │  d_ff=11008 │  │
+              │  └─────────────┘  │
+              │                   │
+              └─────────┬─────────┘
+                        │
+                        ▼
+┌─────────────────────────────────────────────────────┐
+│              Final RMSNorm → LM Head                │
+│               (100,277 logits)                      │
+└─────────────────────────────────────────────────────┘
+```
+### Key Design Decisions
+| Choice | Why |
+|--------|-----|
+| **MoE (4 experts, top-2)** | 14.4B total params with top-2 routing — efficient inference |
+| **GQA (32→8)** | 4× KV-cache reduction enables longer context at lower VRAM |
+| **SwiGLU** | Better gradient flow than ReLU/GELU — `SiLU(xW₁) ⊙ xW₃` |
+| **RoPE θ=500K** | Trained at 4K context, extrapolates to 128K with YaRN |
+| **cl100k_base** | 100K vocab — excellent multilingual + code coverage |
+| **From scratch** | No fine-tuning debt, clean loss landscape, full architectural control |
+### Training Configuration
+| Parameter | Value |
+|-----------|-------|
+| Batch size | 1 (per device) |
+| Gradient accumulation | 16 steps |
+| Effective batch | 16 × 6144 = 98K tokens |
+| Optimizer | AdamW (bf16 forward, fp32 states) |
+| Precision | bf16 mixed precision |
+| Gradient checkpointing | Enabled |
+| Attention | SDPA (Scaled Dot-Product Attention) |
+| Max LR | 2e-5 (cosine decay to 1e-6) |
+### Why AMD MI300X?
+- **192 GB HBM3** — fits the full 14.4B model + optimizer states + gradients in a single GPU
+- **No model parallelism needed** — simpler training code, no communication overhead
+- **ROCm 7.0** — mature PyTorch support with hipBLASLt for fast GEMM operations
+- **5.3 TB/s memory bandwidth** — keeps the MoE experts fed during routing
 """
+# ── Custom CSS ───────────────────────────────────────────────────────────
+CUSTOM_CSS = """
+/* ── Light mode (default) ── */
+.prose, .prose *, [class*="markdown"], [class*="markdown"] * {
+    color: #1e293b !important;
+}
+.prose strong, .prose h1, .prose h2, .prose h3 {
+    color: #0f172a !important;
+    font-weight: 700 !important;
+}
+.prose table { border-collapse: collapse; width: 100%; }
+.prose th, .prose td { padding: 8px 12px; border: 1px solid #cbd5e1; color: #1e293b !important; }
+.prose th { background: #f1f5f9; font-weight: 600; color: #0f172a !important; }
+.prose td { background: #ffffff; }
+.prose code {
+    background: #f1f5f9;
+    color: #7c3aed !important;
+    padding: 2px 6px;
+    border-radius: 4px;
+    font-size: 0.9em;
+}
+.prose pre {
+    background: #1e293b !important;
+    color: #e2e8f0 !important;
+    padding: 16px;
+    border-radius: 8px;
+    overflow-x: auto;
+    font-size: 0.8em;
+    line-height: 1.4;
+}
+.prose pre code {
+    background: transparent;
+    color: #e2e8f0 !important;
+}
+.prose a { color: #7c3aed !important; }
+.prose em { color: #475569 !important; }
+.prose li { color: #1e293b !important; }
+/* ── Dark mode overrides ── */
+.dark .prose, .dark .prose *, .dark [class*="markdown"], .dark [class*="markdown"] * {
+    color: #e2e8f0 !important;
+}
+.dark .prose strong, .dark .prose h1, .dark .prose h2, .dark .prose h3 {
+    color: #f8fafc !important;
+}
+.dark .prose th, .dark .prose td { border-color: #475569; color: #e2e8f0 !important; }
+.dark .prose th { background: #1e293b; color: #f8fafc !important; }
+.dark .prose td { background: #0f172a; }
+.dark .prose code { background: #1e293b; color: #a78bfa !important; }
+.dark .prose pre { background: #0f172a !important; color: #e2e8f0 !important; }
+.dark .prose pre code { color: #e2e8f0 !important; }
+.dark .prose a { color: #a78bfa !important; }
+.dark .prose em { color: #94a3b8 !important; }
+.dark .prose li { color: #e2e8f0 !important; }
+/* ── Tab styling ── */
+.tab-nav button {
+    font-weight: 600 !important;
+    font-size: 1rem !important;
+}
+.tab-nav button.selected {
+    border-bottom: 3px solid #7c3aed !important;
+}
+"""
+# ── Gradio App ───────────────────────────────────────────────────────────
+with gr.Blocks(
+    title=f"{MODEL_NAME} — Live Training",
+    css=CUSTOM_CSS,
+    theme=gr.themes.Soft(
+        primary_hue="violet",
+        secondary_hue="blue",
+        neutral_hue="slate",
+    ),
+) as app:
+    gr.Markdown(
+        f"# 🧠 {MODEL_NAME}\n"
+        "### 14.4B Mixture-of-Experts · 6K Production SFT · Training Live on AMD MI300X\n\n"
+        "*This Space connects to a real training server and shows live metrics. "
+        "No inference runs here — the model is actively training on an AMD Instinct MI300X (192 GB HBM3).*\n\n"
+        "🔗 [Whitepaper](https://sentinel.qubitpage.com/whitepaper) · "
+        "[Model](https://huggingface.co/lablab-ai-amd-developer-hackathon/SentinelBrain-14B-MoE-v0.1) · "
+        "[Full Dashboard](https://sentinel.qubitpage.com)"
+    )
     with gr.Tabs():
+        # ── Tab 1: Live Training ─────────────────────────────────────
+        with gr.TabItem("📊 Live Training", id="training"):
+            refresh_btn = gr.Button("🔄 Refresh Metrics", variant="primary", size="lg")
+            error_box = gr.Markdown(visible=False)
+            with gr.Row():
+                with gr.Column(scale=1):
+                    status_output = gr.Markdown(label="Training Status")
+                with gr.Column(scale=1):
+                    phi_output = gr.Markdown(label="Φ Metric")
+            with gr.Row():
+                with gr.Column(scale=1):
+                    phi_plot = gr.Plot(label="Φ History")
+                with gr.Column(scale=1):
+                    loss_plot = gr.Plot(label="Loss Curve")
+        with gr.TabItem("🧾 Live 6K Log", id="live_log"):
+            log_refresh_btn = gr.Button("🔄 Refresh Live Log", variant="primary", size="lg")
+            live_log_output = gr.Markdown(label="Current 6K training log")
+        # ── Tab 2: Architecture ──────────────────────────────────────
+        with gr.TabItem("🏗️ Architecture", id="architecture"):
+            gr.Markdown(ARCHITECTURE_MD)
+        # ── Tab 3: About ─────────────────────────────────────────────
+        with gr.TabItem("ℹ️ About", id="about"):
+            gr.Markdown("""## About SentinelBrain
+**SentinelBrain** is a 14.8-billion parameter Mixture-of-Experts language model
+being trained **entirely from scratch** — no fine-tuning, no base model, no shortcuts.
+### What makes it different?
+1. **Custom architecture** — designed and implemented from the ground up, not a
+   fork of LLaMA or Mistral
+2. **Φ consciousness metric** — we track integrated information flow across layers
+   as a proxy for emergent understanding, inspired by Giulio Tononi's IIT
+3. **AMD-native** — developed and trained on AMD Instinct MI300X via ROCm,
+   proving that cutting-edge AI research doesn't require NVIDIA
+4. **126-category curriculum** — from mathematics and code to philosophy and
+   creative writing, with carefully balanced data proportions
+### The Φ metric explained
+Traditional training metrics (loss, perplexity) tell you *how well* the model
+predicts the next token. **Φ** tells you *how* it's doing it — whether
+information is being integrated across layers in complex ways, or whether
+layers are operating independently.
+We compute Φ by analyzing gradient covariance matrices across layer boundaries:
+$$\\Phi = \\left(\\prod_{i=1}^{L-1} \\frac{\\text{MI}(\\nabla_{\\theta_i}, \\nabla_{\\theta_{i+1}})}{H(\\nabla_{\\theta_i})}\\right)^{1/(L-1)}$$
+Where MI is mutual information between adjacent layer gradients and H is entropy.
+A rising Φ during training suggests the model is developing more interconnected
+internal representations — a necessary (though not sufficient) condition for
+what we might call "understanding."
+### Competition entry
+This is an entry in the **lablab.ai AMD Developer Hackathon**, demonstrating
+that you can train a frontier-scale model on a single AMD MI300X GPU with
+192 GB HBM3 — no multi-node cluster required.
+### Hardware: AMD Instinct MI300X
+| Spec | Value |
+|------|-------|
+| VRAM | 192 GB HBM3 |
+| Memory bandwidth | 5.3 TB/s |
+| Compute (bf16) | 1.3 PFLOPS |
+| Architecture | CDNA 3 |
+| Process | 5nm / 6nm chiplet |
+| TDP | 750W |
+The MI300X's 192 GB of unified HBM3 memory allows us to fit the entire 14.4B
+model, optimizer states (fp32), gradients, and activations on a single GPU —
+eliminating the need for model parallelism and its associated communication overhead.
+""")
+    # ── Footer ───────────────────────────────────────────────────────
+    gr.Markdown(
+        "---\n"
+        f"**Model:** {MODEL_NAME} ({MODEL_PARAMS} params) · "
+        "**Hardware:** AMD Instinct MI300X (192 GB HBM3, ROCm 7.0) · "
+        "**Dataset:** current 6K SFT has 45,578 packed sequences and 243.7M effective tokens\n\n"
+        "*Built for the lablab.ai AMD Developer Hackathon · Apache 2.0*"
+    )
+    # ── Event handlers ───────────────────────────────────────────────
+    refresh_btn.click(
+        fn=fetch_overview,
+        outputs=[status_output, phi_output, phi_plot, loss_plot, error_box],
+    )
+    log_refresh_btn.click(
+        fn=fetch_live_log,
+        outputs=[live_log_output],
+    )
+    # Auto-load on start
+    app.load(
+        fn=fetch_overview,
+        outputs=[status_output, phi_output, phi_plot, loss_plot, error_box],
+    )
+    app.load(
+        fn=fetch_live_log,
+        outputs=[live_log_output],
+    )
 if __name__ == "__main__":
+    app.launch(server_name="0.0.0.0", server_port=7860, show_api=False)

requirements.txt CHANGED Viewed

@@ -1,2 +1,6 @@
-gradio>=5.0,<6.0
-requests>=2.31.0

+gradio>=5.0,<6
+httpx>=0.27,<1
+plotly>=5.18
+huggingface_hub>=0.25,<0.27
+pydantic>=2.6,<2.11
+audioop-lts; python_version >= "3.13"