qubitpage commited on
Commit
1490f00
·
verified ·
1 Parent(s): abf1448

Update live 6K production training dashboard

Browse files
Files changed (3) hide show
  1. README.md +67 -19
  2. app.py +543 -266
  3. requirements.txt +6 -2
README.md CHANGED
@@ -1,31 +1,79 @@
1
  ---
2
- title: Sentinel Prime (Frankenstein Edition)
3
- emoji: 🛰️
4
- colorFrom: blue
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
- pinned: true
10
  license: apache-2.0
11
- short_description: Live Sentinel Prime training monitor.
12
  tags:
13
- - amd-mi300x
14
  - mixture-of-experts
15
- - live-training
16
- - hackathon
 
 
 
 
 
 
17
  ---
18
 
19
- # Sentinel Prime (Frankenstein Edition)
20
 
21
- This Space is the public competition monitor for the Sentinel Prime realignment run.
22
 
23
- It includes:
24
 
25
- - A concise model card and whitepaper.
26
- - Live training status from the MI300X dashboard.
27
- - Current logs for realignment, watchdog, tokenization, fusion, and dataset history.
28
- - A chronological project timeline from checkpoint recovery through live realignment.
29
- - Session snapshots captured by the Space while judges watch the project.
 
 
 
 
 
30
 
31
- Source telemetry is read from `https://sentinel.qubitpage.com`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: "Sentinel Prime Frankenstein Edition — Live 6K Training"
3
+ emoji: 🧠
4
+ colorFrom: purple
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: "5.0.0"
8
  app_file: app.py
9
+ pinned: false
10
  license: apache-2.0
11
+ short_description: "14.4B MoE 6K production SFT — live on AMD MI300X"
12
  tags:
13
+ - sentinelbrain
14
  - mixture-of-experts
15
+ - amd
16
+ - mi300x
17
+ - rocm
18
+ - consciousness
19
+ - phi-metric
20
+ - training-dashboard
21
+ - language-model
22
+ - moe
23
  ---
24
 
25
+ # 🧠 Sentinel Prime Frankenstein Edition — Live 6K Training Dashboard
26
 
27
+ Watch a **14.4-billion parameter Mixture-of-Experts** model running the current **6K production SFT** live on an AMD Instinct MI300X (192 GB HBM3). This Space connects to the real training server and displays live metrics, current logs, architecture details, and the Φ integrated-information training signal when available.
28
 
29
+ ## What you're seeing
30
 
31
+ | Component | Details |
32
+ |---|---|
33
+ | **Architecture** | Custom MoE: 24 layers, 4 experts (top-2 routing), GQA (32→8), SwiGLU, RMSNorm |
34
+ | **Parameters** | 14.40B loaded in the current production SFT run |
35
+ | **Training data** | 45,578 packed 6K sequences, 243.7M effective SFT tokens |
36
+ | **Hardware** | AMD Instinct MI300X (192 GB HBM3) via AMD Developer Cloud |
37
+ | **Framework** | PyTorch 2.10 + ROCm 7.0 |
38
+ | **Novel metric** | Φ (phi) — integrated information theory applied to gradient flow |
39
+ | **Tokenizer** | tiktoken cl100k_base (100,277 vocab) |
40
+ | **Context** | 6,144 tokens for the current production SFT run |
41
 
42
+ ## Architecture highlights
43
+
44
+ - **Mixture-of-Experts**: 4 experts per layer, top-2 gating — only 2 experts active per token, giving 14.4B total params with efficient active-parameter routing
45
+ - **Grouped Query Attention**: 32 query heads → 8 key-value heads (4× memory reduction)
46
+ - **SwiGLU activation**: `SiLU(xW₁) ⊙ xW₃` instead of standard ReLU — better gradient flow
47
+ - **RoPE positional encoding**: θ=500,000 for long-context extrapolation
48
+ - **Φ consciousness metric**: Measures how information integrates across layers during training — a proxy for "emergent understanding"
49
+
50
+ ## How it was built
51
+
52
+ This is an entry in the **lablab.ai AMD Developer Hackathon**:
53
+
54
+ - **Custom architecture** — no fine-tuning, no base model. SentinelBrain is trained from scratch
55
+ - **126-category curriculum** — mathematics, code, science, philosophy, creative writing, medical, legal, and more
56
+ - **AMD-native** — developed and trained entirely on AMD Instinct MI300X via ROCm
57
+
58
+ ## Links
59
+
60
+ - 📄 [Whitepaper](https://sentinel.qubitpage.com/whitepaper) — Full technical paper with architecture diagrams
61
+ - 📊 [Full Dashboard](https://sentinel.qubitpage.com) — Live monitoring dashboard (may require auth)
62
+ - 🤗 [Model on HuggingFace](https://huggingface.co/lablab-ai-amd-developer-hackathon/SentinelBrain-14B-MoE-v0.1)
63
+
64
+ ## Limitations
65
+
66
+ - The model is **actively training** — weights are not final
67
+ - No inference endpoint yet (14.4B params requires GPU)
68
+ - Metrics refresh every 30 seconds; network latency may cause brief stale readings
69
+ - The Φ metric is experimental and should not be interpreted as literal consciousness
70
+
71
+ ## License
72
+
73
+ Apache 2.0. Model weights, training code, and this Space are open.
74
+
75
+ ## Acknowledgements
76
+
77
+ - AMD for the Developer Cloud credits and MI300X access
78
+ - lablab.ai for organizing the hackathon
79
+ - The open-source datasets that make large-scale training possible
app.py CHANGED
@@ -1,296 +1,573 @@
 
 
 
 
 
 
 
 
 
1
  from __future__ import annotations
2
 
3
- import json
4
- import os
5
- import re
6
  import time
 
7
  from datetime import datetime, timezone
8
- from pathlib import Path
9
- from typing import Any
10
 
11
  import gradio as gr
12
- import requests
 
13
 
 
 
 
 
 
 
14
 
15
- API_BASE = os.getenv("SENTINEL_API_BASE", "https://sentinel.qubitpage.com").rstrip("/")
16
- SNAPSHOT_PATH = Path("history/live_snapshots.jsonl")
17
- SNAPSHOT_PATH.parent.mkdir(parents=True, exist_ok=True)
18
 
19
- LOGS = {
20
- "Realignment": "realign_v2",
21
- "Watchdog": "realign_watchdog",
22
- "Tokenizer": "tokenize",
23
- "Dataset Download": "dataset_download",
24
- "Fusion Models": "fusion_models",
25
- "Transplant": "transplant",
26
- "Post Prepare": "post_prepare",
27
- }
28
 
29
- MODEL_FACTS = {
30
- "Model": "Sentinel Prime (Frankenstein Edition)",
31
- "Type": "14.4B parameter decoder-only Mixture-of-Experts language model",
32
- "Experts": "4 FFN experts, top-2 routing",
33
- "Tokenizer": "cl100k_base, vocab 100,277",
34
- "Current phase": "Text coherence repair / realignment before multimodal projector stages",
35
- "Hardware": "AMD Instinct MI300X, ROCm PyTorch",
36
- }
37
 
38
 
39
- def now_utc() -> str:
40
- return datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
 
 
 
 
 
41
 
42
 
43
- def fetch_json(path: str, timeout: int = 10) -> dict[str, Any]:
 
 
 
44
  try:
45
- response = requests.get(f"{API_BASE}{path}", timeout=timeout)
46
- response.raise_for_status()
47
- return response.json()
48
- except Exception as exc:
49
- return {"_error": f"{type(exc).__name__}: {exc}"}
50
 
51
 
52
- def fetch_text(path: str, timeout: int = 10) -> str:
53
- try:
54
- response = requests.get(f"{API_BASE}{path}", timeout=timeout)
55
- response.raise_for_status()
56
- return response.text
57
- except Exception as exc:
58
- return f"FETCH_ERROR: {type(exc).__name__}: {exc}"
59
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- def fmt_num(value: Any, suffix: str = "") -> str:
 
 
 
 
 
 
 
 
 
 
62
  if value is None:
63
- return "n/a"
64
- try:
65
- if isinstance(value, float):
66
- return f"{value:,.2f}{suffix}"
67
- return f"{int(value):,}{suffix}"
68
- except Exception:
69
- return f"{value}{suffix}"
70
-
71
-
72
- def pct_bar(value: float, tone: str = "ok") -> str:
73
- width = max(0, min(100, value))
74
- cls = "bar-fill"
75
- if tone == "warn" or value >= 90:
76
- cls += " warn"
77
- if tone == "bad" or value >= 97:
78
- cls += " bad"
79
- return f"<div class='bar'><div class='{cls}' style='width:{width:.1f}%'></div></div>"
80
-
81
-
82
- def process_alive(processes: dict[str, Any], label: str) -> bool:
83
- for item in processes.get("processes", []):
84
- if item.get("name") == label:
85
- return bool(item.get("alive"))
86
- return False
87
-
88
-
89
- def latest_training_line(log_text: str) -> str:
90
- lines = [line for line in log_text.splitlines() if re.search(r"^\s*step\s+\d+/", line)]
91
- return lines[-1] if lines else "No training step line observed yet."
92
-
93
-
94
- def infer_status(overview: dict[str, Any], processes: dict[str, Any], watchdog: str) -> tuple[str, str]:
95
- if overview.get("_error"):
96
- return "API offline", "bad"
97
- trainer_alive = process_alive(processes, "Frankenstein realignment")
98
- watchdog_alive = process_alive(processes, "Realignment watchdog")
99
- stop_match = re.findall(r"STOP: .*", watchdog)
100
- if trainer_alive:
101
- return "Training live", "ok"
102
- if stop_match:
103
- return f"Paused by watchdog: {stop_match[-1].replace('STOP: ', '')}", "warn"
104
- if watchdog_alive:
105
- return "Watchdog armed, trainer not visible", "warn"
106
- return "Trainer not running", "bad"
107
-
108
-
109
- def metric_cards(overview: dict[str, Any], system: dict[str, Any], processes: dict[str, Any], realign_log: str, watchdog_log: str) -> str:
110
- training = overview.get("training", {}) if isinstance(overview, dict) else {}
111
- shards = overview.get("shards", {}) if isinstance(overview, dict) else {}
112
- gpu = system.get("vram", {}) if isinstance(system, dict) else {}
113
- ram = system.get("ram", {}) if isinstance(system, dict) else {}
114
- status, tone = infer_status(overview, processes, watchdog_log)
115
- step_line = latest_training_line(realign_log)
116
- step = training.get("current_step")
117
- loss = training.get("train_loss")
118
- lr = training.get("lr")
119
- tok_s = training.get("tok_per_sec")
120
- eta = training.get("eta_hrs")
121
- vram_pct = float(gpu.get("pct") or 0)
122
- ram_pct = float(ram.get("pct") or 0)
123
- status_class = {"ok": "status-ok", "warn": "status-warn", "bad": "status-bad"}.get(tone, "status-bad")
124
- return f"""
125
- <section class='hero'>
126
- <div>
127
- <p class='eyebrow'>Live competition monitor</p>
128
- <h1>Sentinel Prime <span>(Frankenstein Edition)</span></h1>
129
- <p class='subtitle'>A recovered 14.4B MoE checkpoint realigned on AMD MI300X with public logs, watchdog events, and project history for judges.</p>
130
- </div>
131
- <div class='status-pill {status_class}'>{status}</div>
132
- </section>
133
- <section class='grid metrics'>
134
- <div class='panel'><span>Step</span><strong>{fmt_num(step)}</strong><small>Target 5,000</small></div>
135
- <div class='panel'><span>Train loss</span><strong>{fmt_num(loss)}</strong><small>Latest dashboard metric</small></div>
136
- <div class='panel'><span>Learning rate</span><strong>{lr if lr is not None else 'n/a'}</strong><small>Warmup and SGDR schedule</small></div>
137
- <div class='panel'><span>Throughput</span><strong>{fmt_num(tok_s, ' tok/s')}</strong><small>Effective batch 48</small></div>
138
- <div class='panel'><span>ETA</span><strong>{fmt_num(eta, ' h')}</strong><small>From trainer metrics</small></div>
139
- <div class='panel'><span>Tokenized corpus</span><strong>{fmt_num(shards.get('tokens_b'), 'B')}</strong><small>{fmt_num(shards.get('categories'))} categories</small></div>
140
- </section>
141
- <section class='grid bars'>
142
- <div class='panel wide'><div class='row'><span>VRAM</span><b>{fmt_num(gpu.get('used_gb'), 'GB')} / {fmt_num(gpu.get('total_gb'), 'GB')} ({fmt_num(vram_pct, '%')})</b></div>{pct_bar(vram_pct)}</div>
143
- <div class='panel wide'><div class='row'><span>RAM</span><b>{fmt_num(ram.get('used_gb'), 'GB')} / {fmt_num(ram.get('total_gb'), 'GB')} ({fmt_num(ram_pct, '%')})</b></div>{pct_bar(ram_pct)}</div>
144
- </section>
145
- <section class='panel full'>
146
- <span>Latest training line</span>
147
- <pre>{step_line}</pre>
148
- </section>
149
- <p class='updated'>Updated {now_utc()} from {API_BASE}</p>
150
  """
151
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
- def process_table(processes: dict[str, Any]) -> str:
154
- rows = []
155
- for item in processes.get("processes", []):
156
- alive = bool(item.get("alive"))
157
- badge = "LIVE" if alive else "off"
158
- cls = "live" if alive else "off"
159
- pids = ", ".join(str(pid) for pid in item.get("pids", [])) or "-"
160
- rows.append(f"<tr><td>{item.get('name')}</td><td><span class='{cls}'>{badge}</span></td><td>{pids}</td></tr>")
161
- return "<table><thead><tr><th>Process</th><th>Status</th><th>PIDs</th></tr></thead><tbody>" + "".join(rows) + "</tbody></table>"
162
-
163
-
164
- def whitepaper_markdown() -> str:
165
- path = Path("WHITEPAPER.md")
166
- return path.read_text(encoding="utf-8") if path.exists() else "Whitepaper file missing."
167
-
168
-
169
- def history_markdown() -> str:
170
- path = Path("HISTORY.md")
171
- history = path.read_text(encoding="utf-8") if path.exists() else "History file missing."
172
- snapshots = []
173
- if SNAPSHOT_PATH.exists():
174
- lines = SNAPSHOT_PATH.read_text(encoding="utf-8", errors="replace").splitlines()[-10:]
175
- for line in lines:
176
- try:
177
- item = json.loads(line)
178
- snapshots.append(f"- {item.get('ts')} - step {item.get('step')} - {item.get('status')} - loss {item.get('loss')}")
179
- except Exception:
180
- continue
181
- if snapshots:
182
- history += "\n\n## Space Session Snapshots\n\n" + "\n".join(snapshots)
183
- return history
184
-
185
-
186
- def log_bundle() -> tuple[str, str, str, str]:
187
- realign = fetch_text("/api/logs/realign_v2?n=220")
188
- watchdog = fetch_text("/api/logs/realign_watchdog?n=160")
189
- tokenize = fetch_text("/api/logs/tokenize?n=120")
190
- old = []
191
- for label, name in [("Fusion Models", "fusion_models"), ("Transplant", "transplant"), ("Dataset Download", "dataset_download")]:
192
- old.append(f"===== {label} =====\n{fetch_text(f'/api/logs/{name}?n=80')}")
193
- return realign, watchdog, tokenize, "\n\n".join(old)
194
-
195
-
196
- def save_snapshot(status: str, overview: dict[str, Any], system: dict[str, Any]) -> None:
197
- training = overview.get("training", {}) if isinstance(overview, dict) else {}
198
- payload = {
199
- "ts": now_utc(),
200
- "status": status,
201
- "step": training.get("current_step"),
202
- "loss": training.get("train_loss"),
203
- "lr": training.get("lr"),
204
- "tok_per_sec": training.get("tok_per_sec"),
205
- "vram_pct": (system.get("vram", {}) or {}).get("pct"),
206
- }
207
- try:
208
- with SNAPSHOT_PATH.open("a", encoding="utf-8") as handle:
209
- handle.write(json.dumps(payload, ensure_ascii=True) + "\n")
210
- except Exception:
211
- pass
212
-
213
-
214
- def refresh() -> tuple[str, str, str, str, str, str, str, str]:
215
- overview = fetch_json("/api/overview")
216
- system = fetch_json("/api/system")
217
- processes = fetch_json("/api/processes")
218
- realign, watchdog, tokenize, old_logs = log_bundle()
219
- status, _ = infer_status(overview, processes, watchdog)
220
- save_snapshot(status, overview, system)
221
- cards = metric_cards(overview, system, processes, realign, watchdog)
222
- proc_html = process_table(processes)
223
- return cards, proc_html, realign, watchdog, tokenize, old_logs, history_markdown(), whitepaper_markdown()
224
-
225
-
226
- CSS = """
227
- :root { --bg:#08111f; --panel:#111c2b; --panel2:#152436; --text:#f4f7fb; --muted:#9fb0c6; --line:#26384d; --green:#37d399; --blue:#62a8ff; --amber:#f5bd4f; --red:#ff6b6b; }
228
- .gradio-container { background: var(--bg); color: var(--text); font-family: Inter, ui-sans-serif, system-ui, -apple-system, Segoe UI, sans-serif; }
229
- .hero { display:flex; align-items:flex-start; justify-content:space-between; gap:24px; padding:28px; border:1px solid var(--line); background:linear-gradient(135deg, #101d2d, #0c1725); border-radius:8px; }
230
- .eyebrow { margin:0 0 8px; color:var(--green); font-size:13px; text-transform:uppercase; letter-spacing:0; font-weight:700; }
231
- h1 { margin:0; font-size:42px; line-height:1.05; letter-spacing:0; color:var(--text); }
232
- h1 span { color:#b9d8ff; }
233
- .subtitle { margin:12px 0 0; max-width:900px; color:var(--muted); font-size:16px; line-height:1.55; }
234
- .status-pill { white-space:normal; max-width:360px; padding:10px 14px; border-radius:8px; font-weight:800; border:1px solid var(--line); text-align:right; }
235
- .status-ok { color:#062016; background:var(--green); }
236
- .status-warn { color:#231800; background:var(--amber); }
237
- .status-bad { color:#2b0808; background:var(--red); }
238
- .grid { display:grid; gap:12px; margin-top:12px; }
239
- .metrics { grid-template-columns:repeat(6, minmax(120px, 1fr)); }
240
- .bars { grid-template-columns:repeat(2, minmax(240px, 1fr)); }
241
- .panel { background:var(--panel); border:1px solid var(--line); border-radius:8px; padding:16px; min-width:0; }
242
- .panel span { display:block; color:var(--muted); font-size:12px; text-transform:uppercase; letter-spacing:0; font-weight:700; }
243
- .panel strong { display:block; margin-top:8px; font-size:24px; color:var(--text); overflow-wrap:anywhere; }
244
- .panel small { display:block; margin-top:6px; color:var(--muted); }
245
- .wide { min-height:92px; }
246
- .full { margin-top:12px; }
247
- .row { display:flex; align-items:center; justify-content:space-between; gap:12px; }
248
- .row b { color:var(--text); font-size:14px; }
249
- .bar { height:10px; margin-top:14px; border-radius:8px; overflow:hidden; background:#07101d; border:1px solid var(--line); }
250
- .bar-fill { height:100%; background:linear-gradient(90deg, var(--green), var(--blue)); }
251
- .bar-fill.warn { background:linear-gradient(90deg, var(--amber), #ff8f5a); }
252
- .bar-fill.bad { background:linear-gradient(90deg, var(--red), #ff9d9d); }
253
- pre { margin:10px 0 0; padding:12px; background:#07101d; border:1px solid var(--line); border-radius:8px; color:#dbe8ff; white-space:pre-wrap; overflow-wrap:anywhere; }
254
- .updated { color:var(--muted); margin:10px 0 0; font-size:13px; }
255
- table { width:100%; border-collapse:collapse; background:var(--panel); border:1px solid var(--line); border-radius:8px; overflow:hidden; }
256
- th, td { padding:10px 12px; border-bottom:1px solid var(--line); text-align:left; color:var(--text); }
257
- th { color:var(--muted); font-size:12px; text-transform:uppercase; letter-spacing:0; }
258
- .live, .off { display:inline-block; min-width:54px; padding:4px 7px; border-radius:8px; text-align:center; font-size:12px; font-weight:800; }
259
- .live { background:rgba(55,211,153,.16); color:var(--green); border:1px solid rgba(55,211,153,.35); }
260
- .off { background:rgba(255,255,255,.06); color:var(--muted); border:1px solid var(--line); }
261
- .markdown-body, .prose { color:var(--text); }
262
- @media (max-width: 980px) { .metrics { grid-template-columns:repeat(2, minmax(0, 1fr)); } .bars { grid-template-columns:1fr; } .hero { flex-direction:column; } h1 { font-size:34px; } .status-pill { text-align:left; max-width:none; } }
263
- @media (max-width: 560px) { .metrics { grid-template-columns:1fr; } h1 { font-size:28px; } .hero { padding:20px; } }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
  """
265
 
266
 
267
- with gr.Blocks(css=CSS, title="Sentinel Prime (Frankenstein Edition)") as demo:
268
- cards = gr.HTML()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
  with gr.Tabs():
270
- with gr.Tab("Live Training"):
271
- processes_html = gr.HTML()
272
- with gr.Accordion("Realignment Log", open=True):
273
- realign_log = gr.Code(language="shell", lines=18, interactive=False)
274
- with gr.Accordion("Watchdog Log", open=True):
275
- watchdog_log = gr.Code(language="shell", lines=10, interactive=False)
276
- with gr.Accordion("Tokenizer Log", open=False):
277
- tokenize_log = gr.Code(language="shell", lines=12, interactive=False)
278
- with gr.Tab("History"):
279
- history = gr.Markdown()
280
- old_logs = gr.Code(language="shell", lines=18, interactive=False)
281
- with gr.Tab("Whitepaper"):
282
- whitepaper = gr.Markdown()
283
- with gr.Tab("Model Card"):
284
- gr.Markdown("\n".join([f"- **{k}:** {v}" for k, v in MODEL_FACTS.items()]))
285
- gr.Markdown("[Mission Control](https://sentinel.qubitpage.com/) | [Organization](https://huggingface.co/lablab-ai-amd-developer-hackathon)")
286
-
287
- refresh_btn = gr.Button("Refresh now", variant="primary")
288
- outputs = [cards, processes_html, realign_log, watchdog_log, tokenize_log, old_logs, history, whitepaper]
289
- demo.load(refresh, outputs=outputs)
290
- refresh_btn.click(refresh, outputs=outputs)
291
- timer = gr.Timer(value=30)
292
- timer.tick(refresh, outputs=outputs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
293
 
294
 
295
  if __name__ == "__main__":
296
- demo.launch()
 
1
+ """SentinelBrain-14B MoE — Live Training Dashboard (HuggingFace Space).
2
+
3
+ Connects to the training server at sentinel.qubitpage.com and displays
4
+ real-time metrics: loss curves, throughput, VRAM, the novel Φ consciousness
5
+ metric, and architecture details. Refreshes every 30 seconds.
6
+
7
+ No model inference runs here — the 14.4B-param model is training on an
8
+ AMD Instinct MI300X and this Space is a live window into that process.
9
+ """
10
  from __future__ import annotations
11
 
 
 
 
12
  import time
13
+ import traceback
14
  from datetime import datetime, timezone
 
 
15
 
16
  import gradio as gr
17
+ import httpx
18
+ import plotly.graph_objects as go
19
 
20
+ # ── Config ───────────────────────────────────────────────────────────────
21
+ API_BASE = "https://sentinel.qubitpage.com"
22
+ REFRESH_INTERVAL = 30 # seconds
23
+ MODEL_PARAMS = "14,400,000,000"
24
+ MODEL_NAME = "Sentinel Prime Frankenstein Edition"
25
+ HF_SPACE = "lablab-ai-amd-developer-hackathon/sentinel-prime-frankenstein-edition"
26
 
27
+ # ── API helpers ──────────────────────────────────────────────────────────
28
+ _client = httpx.Client(timeout=15, follow_redirects=True)
 
29
 
 
 
 
 
 
 
 
 
 
30
 
31
+ def _fetch(endpoint: str) -> dict:
32
+ """Fetch JSON from the training server API."""
33
+ try:
34
+ r = _client.get(f"{API_BASE}{endpoint}")
35
+ r.raise_for_status()
36
+ return r.json()
37
+ except Exception as e:
38
+ return {"_error": str(e)}
39
 
40
 
41
+ def _fetch_text(endpoint: str) -> str:
42
+ try:
43
+ r = _client.get(f"{API_BASE}{endpoint}")
44
+ r.raise_for_status()
45
+ return r.text
46
+ except Exception as e:
47
+ return f"Cannot reach training server: {e}"
48
 
49
 
50
+ def _safe(val, fmt=".2f", fallback="—"):
51
+ """Format a numeric value safely."""
52
+ if val is None:
53
+ return fallback
54
  try:
55
+ return f"{float(val):{fmt}}"
56
+ except (ValueError, TypeError):
57
+ return fallback
 
 
58
 
59
 
60
+ # ── Formatters ───────────────────────────────────────────────────────────
 
 
 
 
 
 
61
 
62
+ def _format_tokens(n: int | float | None) -> str:
63
+ if n is None:
64
+ return "—"
65
+ n = int(n)
66
+ if n >= 1_000_000_000:
67
+ return f"{n / 1e9:.2f}B"
68
+ if n >= 1_000_000:
69
+ return f"{n / 1e6:.1f}M"
70
+ if n >= 1_000:
71
+ return f"{n / 1e3:.1f}K"
72
+ return str(n)
73
 
74
+
75
+ def _format_eta(hrs: float | None) -> str:
76
+ if hrs is None:
77
+ return "—"
78
+ h = int(hrs)
79
+ m = int((hrs - h) * 60)
80
+ return f"{h}h {m}m"
81
+
82
+
83
+ def _phi_bar(value: float | None) -> str:
84
+ """Create a visual bar for phi value (0-1 range)."""
85
  if value is None:
86
+ return ""
87
+ v = max(0, min(1, float(value)))
88
+ filled = int(v * 20)
89
+ bar = "" * filled + "░" * (20 - filled)
90
+ return f"`{bar}` {v:.4f}"
91
+
92
+
93
+ # ── Build live metrics display ───────────────────────────────────────────
94
+
95
+ def fetch_overview():
96
+ """Fetch all metrics and return formatted display components."""
97
+ data = _fetch("/api/overview")
98
+ if "_error" in data:
99
+ error_msg = f"⚠️ **Cannot reach training server**: {data['_error']}\n\nThe server may be temporarily unavailable. Metrics will refresh automatically."
100
+ return error_msg, None, None, None, ""
101
+
102
+ t = data.get("training", {})
103
+ phi = t.get("phi", {})
104
+ model = t.get("model", {})
105
+ phase3 = t.get("phase3_dataset", {})
106
+ vram = data.get("vram", {})
107
+ ram = data.get("ram", {})
108
+ shards = data.get("shards", {})
109
+
110
+ # ── Training Status Card ─────────────────────────────────────────
111
+ phase = t.get("phase", "unknown")
112
+ phase_emoji = {"phase3_sft": "🟢", "training": "🟢", "warming": "🟡", "evaluating": "🔵", "idle": "⚪"}.get(phase, "⚫")
113
+
114
+ step = t.get("current_step", 0)
115
+ total_steps = t.get("batch_steps", 0)
116
+ progress = t.get("progress_pct", 0)
117
+ loss = t.get("train_loss")
118
+ val_loss = t.get("val_loss")
119
+ best_val = t.get("best_val")
120
+ tok_s = t.get("tok_per_sec")
121
+ eta = t.get("eta_hrs")
122
+ lr = t.get("lr")
123
+ gnorm = t.get("gnorm")
124
+
125
+ status_md = f"""## {phase_emoji} Training: **{phase.upper()}**
126
+
127
+ | Metric | Value |
128
+ |--------|-------|
129
+ | **Step** | {step:,} / {total_steps:,} ({_safe(progress, '.1f')}%) |
130
+ | **Training Loss** | {_safe(loss, '.4f')} |
131
+ | **Validation Loss** | {_safe(val_loss, '.4f')} |
132
+ | **Best Validation** | {_safe(best_val, '.4f')} |
133
+ | **Throughput** | {_safe(tok_s, ',.0f')} tok/s |
134
+ | **Learning Rate** | {_safe(lr, '.2e')} |
135
+ | **Gradient Norm** | {_safe(gnorm, '.3f')} |
136
+ | **ETA** | {_format_eta(eta)} |
137
+ | **Context length** | {phase3.get('seq_len') or model.get('seq_len', '—')} tokens |
138
+ | **Batch / grad accum** | {phase3.get('batch_size') or model.get('batch', '—')} / {phase3.get('grad_accum') or model.get('grad_accum', '—')} |
139
+
140
+ ### Hardware
141
+ | Resource | Usage |
142
+ |----------|-------|
143
+ | **VRAM** | {_safe(vram.get('used_gb'), '.1f')} / {_safe(vram.get('total_gb'), '.1f')} GB ({_safe(vram.get('pct'), '.0f')}%) |
144
+ | **RAM** | {_safe(ram.get('used_gb'), '.1f')} / {_safe(ram.get('total_gb'), '.1f')} GB ({_safe(ram.get('pct'), '.0f')}%) |
145
+ | **GPU** | AMD Instinct MI300X (192 GB HBM3) |
146
+
147
+ ### Dataset
148
+ | Stat | Value |
149
+ |------|-------|
150
+ | **Categories** | {shards.get('categories', '—')} |
151
+ | **Total tokens** | {_format_tokens(shards.get('tokens'))} |
152
+ | **Pretrain** | {_safe(shards.get('pretrain_tokens_b'), '.2f')}B tokens |
153
+ | **SFT** | {_safe(shards.get('sft_tokens_b'), '.3f')}B tokens |
154
+ | **6K production sequences** | {phase3.get('total_sequences', '—')} |
155
+ | **6K packing efficiency** | {_safe((phase3.get('packing_efficiency') or 0) * 100, '.1f')}% |
156
+
157
+ *Last updated: {datetime.now(timezone.utc).strftime('%H:%M:%S UTC')}*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
  """
159
 
160
+ # ── Φ (Consciousness) Card ───────────────────────────────────────
161
+ phi_geo = phi.get("geometric")
162
+ phi_norm = phi.get("normalized")
163
+ phi_ema = phi.get("ema")
164
+ phi_trend = phi.get("trend", "—")
165
+ phi_arrow = phi.get("trend_arrow", "")
166
+ phi_slope = phi.get("trend_slope")
167
+ phi_conf = phi.get("trend_confidence", "—")
168
+
169
+ phi_md = f"""## 🧠 Φ Consciousness Metric
170
+
171
+ The **Φ (phi)** metric measures integrated information flow across the model's
172
+ layers during training — inspired by Integrated Information Theory (IIT).
173
+ Higher Φ suggests more complex, interconnected representations emerging.
174
+
175
+ | Metric | Value |
176
+ |--------|-------|
177
+ | **Φ Geometric** | {_phi_bar(phi_geo)} |
178
+ | **Φ Normalized** | {_phi_bar(phi_norm)} |
179
+ | **Φ EMA** | {_phi_bar(phi_ema)} |
180
+ | **Trend** | {phi_arrow} {phi_trend} |
181
+ | **Slope** | {_safe(phi_slope, '.6f')} |
182
+ | **Confidence** | {phi_conf} |
183
+
184
+ ### What does Φ mean?
185
+
186
+ - **Φ < 0.1** — Early training, layers acting independently
187
+ - **Φ 0.1–0.3** — Information beginning to integrate across layers
188
+ - **Φ 0.3–0.5** — Strong cross-layer information flow emerging
189
+ - **Φ > 0.5** — High integration — complex representations forming
190
+ - **Φ > 0.7** — Exceptional — approaching theoretical maximum for this architecture
191
+ """
192
 
193
+ # ── Phi History Chart ────────────────────────────────────────────
194
+ phi_chart = None
195
+ phi_recent = data.get("phi_recent", [])
196
+ if phi_recent and len(phi_recent) > 2:
197
+ steps_list = [p.get("step", i) for i, p in enumerate(phi_recent)]
198
+ geo_list = [p.get("geometric") for p in phi_recent]
199
+ norm_list = [p.get("normalized") for p in phi_recent]
200
+ ema_list = [p.get("ema") for p in phi_recent]
201
+
202
+ fig = go.Figure()
203
+ if any(v is not None for v in geo_list):
204
+ fig.add_trace(go.Scatter(
205
+ x=steps_list, y=geo_list, mode="lines",
206
+ name="Φ Geometric", line=dict(color="#8b5cf6", width=2),
207
+ ))
208
+ if any(v is not None for v in norm_list):
209
+ fig.add_trace(go.Scatter(
210
+ x=steps_list, y=norm_list, mode="lines",
211
+ name= Normalized", line=dict(color="#06b6d4", width=2),
212
+ ))
213
+ if any(v is not None for v in ema_list):
214
+ fig.add_trace(go.Scatter(
215
+ x=steps_list, y=ema_list, mode="lines",
216
+ name="Φ EMA", line=dict(color="#f59e0b", width=2, dash="dot"),
217
+ ))
218
+ fig.update_layout(
219
+ title="Φ Consciousness Metric Over Training",
220
+ xaxis_title="Step",
221
+ yaxis_title="Φ Value",
222
+ template="plotly_white",
223
+ height=350,
224
+ margin=dict(l=50, r=20, t=50, b=40),
225
+ legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
226
+ plot_bgcolor="#fafafa",
227
+ paper_bgcolor="#fafafa",
228
+ font=dict(color="#1e293b"),
229
+ )
230
+ phi_chart = fig
231
+
232
+ # ── Loss Chart from recent history ───────────────────────────────
233
+ loss_chart = None
234
+ history = t.get("recent_history", [])
235
+ if history and len(history) > 1:
236
+ batch_nums = list(range(len(history)))
237
+ train_losses = [h.get("loss_end") or h.get("train_loss") for h in history]
238
+ val_losses = [h.get("val_end") or h.get("val_loss") for h in history]
239
+
240
+ fig2 = go.Figure()
241
+ if any(v is not None for v in train_losses):
242
+ fig2.add_trace(go.Scatter(
243
+ x=batch_nums, y=train_losses, mode="lines+markers",
244
+ name="Train Loss", line=dict(color="#ef4444", width=2),
245
+ ))
246
+ if any(v is not None for v in val_losses):
247
+ fig2.add_trace(go.Scatter(
248
+ x=batch_nums, y=val_losses, mode="lines+markers",
249
+ name="Val Loss", line=dict(color="#22c55e", width=2),
250
+ ))
251
+ fig2.update_layout(
252
+ title="Loss Over Recent Training Batches",
253
+ xaxis_title="Batch",
254
+ yaxis_title="Loss",
255
+ template="plotly_white",
256
+ height=350,
257
+ margin=dict(l=50, r=20, t=50, b=40),
258
+ legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
259
+ plot_bgcolor="#fafafa",
260
+ paper_bgcolor="#fafafa",
261
+ font=dict(color="#1e293b"),
262
+ )
263
+ loss_chart = fig2
264
+
265
+ # ── Checkpoints info ─────────────────────────────────────────────
266
+ ckpts = data.get("checkpoints", [])
267
+ ckpt_md = ""
268
+ if ckpts:
269
+ ckpt_md = "\n### Saved Checkpoints\n\n| Checkpoint | Val Loss | Tokens |\n|-----------|----------|--------|\n"
270
+ for c in ckpts[-5:]:
271
+ name = c.get("name", "—")
272
+ vloss = _safe(c.get("val_loss"), ".4f")
273
+ toks = _format_tokens(c.get("tokens_trained"))
274
+ ckpt_md += f"| {name} | {vloss} | {toks} |\n"
275
+
276
+ return status_md + ckpt_md, phi_md, phi_chart, loss_chart, ""
277
+
278
+
279
+ def fetch_live_log():
280
+ text = _fetch_text("/api/logs/phase3_production_train_6k?n=120")
281
+ text = text.replace("```", "'''")
282
+ return f"```text\n{text}\n```"
283
+
284
+
285
+ # ── Architecture diagram ─────────────────────────────────────────────────
286
+
287
+ ARCHITECTURE_MD = f"""## 🏗️ SentinelBrain-14B MoE Architecture
288
+
289
+ **{MODEL_PARAMS} parameters** trained from scratch, no base model.
290
+
291
+ ```
292
+ ┌─────────────────────────────────────────────────────┐
293
+ │ Input Tokens │
294
+ │ tiktoken cl100k_base (100,277)
295
+ └───────────────────────┬─────────────────────────────┘
296
+
297
+
298
+ ┌─────────────────────────────────────────────────────┐
299
+ │ Token Embedding (d=4096)
300
+ │ + RoPE Positional Encoding │
301
+ │ θ=500,000 (128K capable) │
302
+ └───────────────────────┬─────────────────────────────┘
303
+
304
+ ┌─────────▼─────────┐
305
+ │ × 24 Layers │
306
+ │ │
307
+ │ ┌─────────────┐ │
308
+ │ │ RMSNorm │ │
309
+ │ └──────┬──────┘ │
310
+ │ ▼ │
311
+ │ ┌─────────────┐ │
312
+ │ │ GQA │ │
313
+ │ │ 32Q → 8KV │ │
314
+ │ │ (4× save) │ │
315
+ │ └──────┬──────┘ │
316
+ │ ▼ │
317
+ │ ┌─────────────┐ │
318
+ │ │ RMSNorm │ │
319
+ │ └──────┬──────┘ │
320
+ │ ▼ │
321
+ │ ┌─────────────┐ │
322
+ │ │ MoE Block │ │
323
+ │ │ 4 experts │ │
324
+ │ │ top-2 gate │ │
325
+ │ │ SwiGLU FFN │ │
326
+ │ │ d_ff=11008 │ │
327
+ │ └─────────────┘ │
328
+ │ │
329
+ └─────────┬─────────┘
330
+
331
+
332
+ ┌─────────────────────────────────────────────────────┐
333
+ │ Final RMSNorm → LM Head │
334
+ │ (100,277 logits) │
335
+ └─────────────────────────────────────────────────────┘
336
+ ```
337
+
338
+ ### Key Design Decisions
339
+
340
+ | Choice | Why |
341
+ |--------|-----|
342
+ | **MoE (4 experts, top-2)** | 14.4B total params with top-2 routing — efficient inference |
343
+ | **GQA (32→8)** | 4× KV-cache reduction enables longer context at lower VRAM |
344
+ | **SwiGLU** | Better gradient flow than ReLU/GELU — `SiLU(xW₁) ⊙ xW₃` |
345
+ | **RoPE θ=500K** | Trained at 4K context, extrapolates to 128K with YaRN |
346
+ | **cl100k_base** | 100K vocab — excellent multilingual + code coverage |
347
+ | **From scratch** | No fine-tuning debt, clean loss landscape, full architectural control |
348
+
349
+ ### Training Configuration
350
+
351
+ | Parameter | Value |
352
+ |-----------|-------|
353
+ | Batch size | 1 (per device) |
354
+ | Gradient accumulation | 16 steps |
355
+ | Effective batch | 16 × 6144 = 98K tokens |
356
+ | Optimizer | AdamW (bf16 forward, fp32 states) |
357
+ | Precision | bf16 mixed precision |
358
+ | Gradient checkpointing | Enabled |
359
+ | Attention | SDPA (Scaled Dot-Product Attention) |
360
+ | Max LR | 2e-5 (cosine decay to 1e-6) |
361
+
362
+ ### Why AMD MI300X?
363
+
364
+ - **192 GB HBM3** — fits the full 14.4B model + optimizer states + gradients in a single GPU
365
+ - **No model parallelism needed** — simpler training code, no communication overhead
366
+ - **ROCm 7.0** — mature PyTorch support with hipBLASLt for fast GEMM operations
367
+ - **5.3 TB/s memory bandwidth** — keeps the MoE experts fed during routing
368
  """
369
 
370
 
371
+ # ── Custom CSS ───────────────────────────────────────────────────────────
372
+
373
+ CUSTOM_CSS = """
374
+ /* ── Light mode (default) ── */
375
+ .prose, .prose *, [class*="markdown"], [class*="markdown"] * {
376
+ color: #1e293b !important;
377
+ }
378
+ .prose strong, .prose h1, .prose h2, .prose h3 {
379
+ color: #0f172a !important;
380
+ font-weight: 700 !important;
381
+ }
382
+ .prose table { border-collapse: collapse; width: 100%; }
383
+ .prose th, .prose td { padding: 8px 12px; border: 1px solid #cbd5e1; color: #1e293b !important; }
384
+ .prose th { background: #f1f5f9; font-weight: 600; color: #0f172a !important; }
385
+ .prose td { background: #ffffff; }
386
+ .prose code {
387
+ background: #f1f5f9;
388
+ color: #7c3aed !important;
389
+ padding: 2px 6px;
390
+ border-radius: 4px;
391
+ font-size: 0.9em;
392
+ }
393
+ .prose pre {
394
+ background: #1e293b !important;
395
+ color: #e2e8f0 !important;
396
+ padding: 16px;
397
+ border-radius: 8px;
398
+ overflow-x: auto;
399
+ font-size: 0.8em;
400
+ line-height: 1.4;
401
+ }
402
+ .prose pre code {
403
+ background: transparent;
404
+ color: #e2e8f0 !important;
405
+ }
406
+ .prose a { color: #7c3aed !important; }
407
+ .prose em { color: #475569 !important; }
408
+ .prose li { color: #1e293b !important; }
409
+
410
+ /* ── Dark mode overrides ── */
411
+ .dark .prose, .dark .prose *, .dark [class*="markdown"], .dark [class*="markdown"] * {
412
+ color: #e2e8f0 !important;
413
+ }
414
+ .dark .prose strong, .dark .prose h1, .dark .prose h2, .dark .prose h3 {
415
+ color: #f8fafc !important;
416
+ }
417
+ .dark .prose th, .dark .prose td { border-color: #475569; color: #e2e8f0 !important; }
418
+ .dark .prose th { background: #1e293b; color: #f8fafc !important; }
419
+ .dark .prose td { background: #0f172a; }
420
+ .dark .prose code { background: #1e293b; color: #a78bfa !important; }
421
+ .dark .prose pre { background: #0f172a !important; color: #e2e8f0 !important; }
422
+ .dark .prose pre code { color: #e2e8f0 !important; }
423
+ .dark .prose a { color: #a78bfa !important; }
424
+ .dark .prose em { color: #94a3b8 !important; }
425
+ .dark .prose li { color: #e2e8f0 !important; }
426
+
427
+ /* ── Tab styling ── */
428
+ .tab-nav button {
429
+ font-weight: 600 !important;
430
+ font-size: 1rem !important;
431
+ }
432
+ .tab-nav button.selected {
433
+ border-bottom: 3px solid #7c3aed !important;
434
+ }
435
+ """
436
+
437
+
438
+ # ── Gradio App ───────────────────────────────────────────────────────────
439
+
440
+ with gr.Blocks(
441
+ title=f"{MODEL_NAME} — Live Training",
442
+ css=CUSTOM_CSS,
443
+ theme=gr.themes.Soft(
444
+ primary_hue="violet",
445
+ secondary_hue="blue",
446
+ neutral_hue="slate",
447
+ ),
448
+ ) as app:
449
+
450
+ gr.Markdown(
451
+ f"# 🧠 {MODEL_NAME}\n"
452
+ "### 14.4B Mixture-of-Experts · 6K Production SFT · Training Live on AMD MI300X\n\n"
453
+ "*This Space connects to a real training server and shows live metrics. "
454
+ "No inference runs here — the model is actively training on an AMD Instinct MI300X (192 GB HBM3).*\n\n"
455
+ "🔗 [Whitepaper](https://sentinel.qubitpage.com/whitepaper) · "
456
+ "[Model](https://huggingface.co/lablab-ai-amd-developer-hackathon/SentinelBrain-14B-MoE-v0.1) · "
457
+ "[Full Dashboard](https://sentinel.qubitpage.com)"
458
+ )
459
+
460
  with gr.Tabs():
461
+ # ── Tab 1: Live Training ─────────────────────────────────────
462
+ with gr.TabItem("📊 Live Training", id="training"):
463
+ refresh_btn = gr.Button("🔄 Refresh Metrics", variant="primary", size="lg")
464
+ error_box = gr.Markdown(visible=False)
465
+
466
+ with gr.Row():
467
+ with gr.Column(scale=1):
468
+ status_output = gr.Markdown(label="Training Status")
469
+ with gr.Column(scale=1):
470
+ phi_output = gr.Markdown(label="Φ Metric")
471
+
472
+ with gr.Row():
473
+ with gr.Column(scale=1):
474
+ phi_plot = gr.Plot(label="Φ History")
475
+ with gr.Column(scale=1):
476
+ loss_plot = gr.Plot(label="Loss Curve")
477
+
478
+ with gr.TabItem("🧾 Live 6K Log", id="live_log"):
479
+ log_refresh_btn = gr.Button("🔄 Refresh Live Log", variant="primary", size="lg")
480
+ live_log_output = gr.Markdown(label="Current 6K training log")
481
+
482
+ # ── Tab 2: Architecture ──────────────────────────────────────
483
+ with gr.TabItem("🏗️ Architecture", id="architecture"):
484
+ gr.Markdown(ARCHITECTURE_MD)
485
+
486
+ # ── Tab 3: About ─────────────────────────────────────────────
487
+ with gr.TabItem("ℹ️ About", id="about"):
488
+ gr.Markdown("""## About SentinelBrain
489
+
490
+ **SentinelBrain** is a 14.8-billion parameter Mixture-of-Experts language model
491
+ being trained **entirely from scratch** — no fine-tuning, no base model, no shortcuts.
492
+
493
+ ### What makes it different?
494
+
495
+ 1. **Custom architecture** — designed and implemented from the ground up, not a
496
+ fork of LLaMA or Mistral
497
+ 2. **Φ consciousness metric** — we track integrated information flow across layers
498
+ as a proxy for emergent understanding, inspired by Giulio Tononi's IIT
499
+ 3. **AMD-native** — developed and trained on AMD Instinct MI300X via ROCm,
500
+ proving that cutting-edge AI research doesn't require NVIDIA
501
+ 4. **126-category curriculum** — from mathematics and code to philosophy and
502
+ creative writing, with carefully balanced data proportions
503
+
504
+ ### The Φ metric explained
505
+
506
+ Traditional training metrics (loss, perplexity) tell you *how well* the model
507
+ predicts the next token. **Φ** tells you *how* it's doing it — whether
508
+ information is being integrated across layers in complex ways, or whether
509
+ layers are operating independently.
510
+
511
+ We compute Φ by analyzing gradient covariance matrices across layer boundaries:
512
+
513
+ $$\\Phi = \\left(\\prod_{i=1}^{L-1} \\frac{\\text{MI}(\\nabla_{\\theta_i}, \\nabla_{\\theta_{i+1}})}{H(\\nabla_{\\theta_i})}\\right)^{1/(L-1)}$$
514
+
515
+ Where MI is mutual information between adjacent layer gradients and H is entropy.
516
+ A rising Φ during training suggests the model is developing more interconnected
517
+ internal representations — a necessary (though not sufficient) condition for
518
+ what we might call "understanding."
519
+
520
+ ### Competition entry
521
+
522
+ This is an entry in the **lablab.ai AMD Developer Hackathon**, demonstrating
523
+ that you can train a frontier-scale model on a single AMD MI300X GPU with
524
+ 192 GB HBM3 — no multi-node cluster required.
525
+
526
+ ### Hardware: AMD Instinct MI300X
527
+
528
+ | Spec | Value |
529
+ |------|-------|
530
+ | VRAM | 192 GB HBM3 |
531
+ | Memory bandwidth | 5.3 TB/s |
532
+ | Compute (bf16) | 1.3 PFLOPS |
533
+ | Architecture | CDNA 3 |
534
+ | Process | 5nm / 6nm chiplet |
535
+ | TDP | 750W |
536
+
537
+ The MI300X's 192 GB of unified HBM3 memory allows us to fit the entire 14.4B
538
+ model, optimizer states (fp32), gradients, and activations on a single GPU —
539
+ eliminating the need for model parallelism and its associated communication overhead.
540
+ """)
541
+
542
+ # ── Footer ───────────────────────────────────────────────────────
543
+ gr.Markdown(
544
+ "---\n"
545
+ f"**Model:** {MODEL_NAME} ({MODEL_PARAMS} params) · "
546
+ "**Hardware:** AMD Instinct MI300X (192 GB HBM3, ROCm 7.0) · "
547
+ "**Dataset:** current 6K SFT has 45,578 packed sequences and 243.7M effective tokens\n\n"
548
+ "*Built for the lablab.ai AMD Developer Hackathon · Apache 2.0*"
549
+ )
550
+
551
+ # ── Event handlers ───────────────────────────────────────────────
552
+ refresh_btn.click(
553
+ fn=fetch_overview,
554
+ outputs=[status_output, phi_output, phi_plot, loss_plot, error_box],
555
+ )
556
+ log_refresh_btn.click(
557
+ fn=fetch_live_log,
558
+ outputs=[live_log_output],
559
+ )
560
+
561
+ # Auto-load on start
562
+ app.load(
563
+ fn=fetch_overview,
564
+ outputs=[status_output, phi_output, phi_plot, loss_plot, error_box],
565
+ )
566
+ app.load(
567
+ fn=fetch_live_log,
568
+ outputs=[live_log_output],
569
+ )
570
 
571
 
572
  if __name__ == "__main__":
573
+ app.launch(server_name="0.0.0.0", server_port=7860, show_api=False)
requirements.txt CHANGED
@@ -1,2 +1,6 @@
1
- gradio>=5.0,<6.0
2
- requests>=2.31.0
 
 
 
 
 
1
+ gradio>=5.0,<6
2
+ httpx>=0.27,<1
3
+ plotly>=5.18
4
+ huggingface_hub>=0.25,<0.27
5
+ pydantic>=2.6,<2.11
6
+ audioop-lts; python_version >= "3.13"