🧟 Frankenstein Edition branding + knowledge transplant section
Browse files
README.md
CHANGED
|
@@ -1,702 +1,762 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
- ro
|
| 6 |
-
- multilingual
|
| 7 |
-
tags:
|
| 8 |
-
- sentinelbrain
|
| 9 |
-
- mixture-of-experts
|
| 10 |
-
- from-scratch
|
| 11 |
-
- consciousness
|
| 12 |
-
- amd
|
| 13 |
-
- mi300x
|
| 14 |
-
- rocm
|
| 15 |
-
- moe
|
| 16 |
-
- transformer
|
| 17 |
-
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
-
|
| 25 |
-
-
|
| 26 |
-
-
|
| 27 |
-
-
|
| 28 |
-
|
| 29 |
-
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
---
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
4
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
│
|
| 115 |
-
|
| 116 |
-
│
|
| 117 |
-
│
|
| 118 |
-
│
|
| 119 |
-
│
|
| 120 |
-
│
|
| 121 |
-
│
|
| 122 |
-
│
|
| 123 |
-
│
|
| 124 |
-
│
|
| 125 |
-
│
|
| 126 |
-
│ │
|
| 127 |
-
│ │ ┌───────
|
| 128 |
-
│ │ │
|
| 129 |
-
│ │ │
|
| 130 |
-
│ │
|
| 131 |
-
│ │
|
| 132 |
-
│ │
|
| 133 |
-
│ │
|
| 134 |
-
│ │
|
| 135 |
-
│ │
|
| 136 |
-
│ │ │
|
| 137 |
-
│ │ └──────
|
| 138 |
-
│ └───────────
|
| 139 |
-
│
|
| 140 |
-
│
|
| 141 |
-
│
|
| 142 |
-
│
|
| 143 |
-
│
|
| 144 |
-
│
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
|
| 156 |
-
|
|
| 157 |
-
| **
|
| 158 |
-
| **
|
| 159 |
-
| **
|
| 160 |
-
| **
|
| 161 |
-
| **
|
| 162 |
-
| **
|
| 163 |
-
| **
|
| 164 |
-
| **
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
Φ
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
###
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
```
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
|
|
| 287 |
-
|
|
| 288 |
-
| **
|
| 289 |
-
| **
|
| 290 |
-
| **
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
|
| 328 |
-
|
|
| 329 |
-
|
|
| 330 |
-
|
|
| 331 |
-
|
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
|
| 338 |
-
|
|
| 339 |
-
|
|
| 340 |
-
|
|
| 341 |
-
|
|
| 342 |
-
|
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
|
| 347 |
-
|
|
| 348 |
-
|
|
| 349 |
-
|
|
| 350 |
-
|
|
| 351 |
-
|
|
| 352 |
-
|
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
|
| 399 |
-
|
|
| 400 |
-
|
|
| 401 |
-
|
|
| 402 |
-
|
|
| 403 |
-
|
|
| 404 |
-
|
|
| 405 |
-
|
|
| 406 |
-
|
|
| 407 |
-
|
|
| 408 |
-
|
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
|
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
|
| 421 |
-
|
| 422 |
-
|
| 423 |
-
|
| 424 |
-
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
---
|
| 434 |
-
|
| 435 |
-
##
|
| 436 |
-
|
| 437 |
-
###
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
|
| 467 |
-
|
|
| 468 |
-
| **
|
| 469 |
-
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
|
| 485 |
-
|
|
| 486 |
-
|
|
| 487 |
-
| **
|
| 488 |
-
| **
|
| 489 |
-
| **
|
| 490 |
-
| **
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
|
| 522 |
-
|
| 523 |
-
|
| 524 |
-
#
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
| 532 |
-
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
|
| 540 |
-
|
| 541 |
-
|
| 542 |
-
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
##
|
| 553 |
-
|
| 554 |
-
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
|
| 559 |
-
|
| 560 |
-
|
| 561 |
-
|
| 562 |
-
|
| 563 |
-
|
| 564 |
-
|
| 565 |
-
|
| 566 |
-
|
| 567 |
-
**
|
| 568 |
-
|
| 569 |
-
|
| 570 |
-
|
| 571 |
-
|
| 572 |
-
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
|
| 576 |
-
|
| 577 |
-
|
| 578 |
-
|
| 579 |
-
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
|
| 592 |
-
|
| 593 |
-
|
| 594 |
-
|
| 595 |
-
|
| 596 |
-
|
| 597 |
-
|
| 598 |
-
|
| 599 |
-
|
| 600 |
-
|
| 601 |
-
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
|
| 612 |
-
|
| 613 |
-
|
| 614 |
-
|
| 615 |
-
|
| 616 |
-
|
| 617 |
-
|
| 618 |
-
|
| 619 |
-
|
| 620 |
-
|
| 621 |
-
|
| 622 |
-
|
| 623 |
-
|
| 624 |
-
|
| 625 |
-
|
| 626 |
-
|
| 627 |
-
|
| 628 |
-
|
| 629 |
-
|
| 630 |
-
|
| 631 |
-
<
|
| 632 |
-
|
| 633 |
-
|
| 634 |
-
|
| 635 |
-
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
| 640 |
-
|
| 641 |
-
|
| 642 |
-
|
| 643 |
-
|
| 644 |
-
|
| 645 |
-
|
| 646 |
-
|
| 647 |
-
<
|
| 648 |
-
|
| 649 |
-
|
| 650 |
-
|
| 651 |
-
|
| 652 |
-
|
| 653 |
-
|
| 654 |
-
|
| 655 |
-
|
| 656 |
-
|
| 657 |
-
|
| 658 |
-
|
| 659 |
-
|
| 660 |
-
|
| 661 |
-
|
| 662 |
-
|
| 663 |
-
|
| 664 |
-
|
| 665 |
-
|
| 666 |
-
|
| 667 |
-
|
| 668 |
-
|
| 669 |
-
|
| 670 |
-
|
| 671 |
-
|
| 672 |
-
|
| 673 |
-
|
| 674 |
-
|
| 675 |
-
|
| 676 |
-
|
| 677 |
-
|
| 678 |
-
|
| 679 |
-
|
| 680 |
-
|
| 681 |
-
|
| 682 |
-
|
| 683 |
-
|
| 684 |
-
|
| 685 |
-
|
| 686 |
-
|
| 687 |
-
|
| 688 |
-
|
| 689 |
-
|
| 690 |
-
|
| 691 |
-
|
| 692 |
-
|
| 693 |
-
|
| 694 |
-
|
| 695 |
-
|
| 696 |
-
|
| 697 |
-
|
| 698 |
-
|
| 699 |
-
|
| 700 |
-
|
| 701 |
-
|
| 702 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- ro
|
| 6 |
+
- multilingual
|
| 7 |
+
tags:
|
| 8 |
+
- sentinelbrain
|
| 9 |
+
- mixture-of-experts
|
| 10 |
+
- from-scratch
|
| 11 |
+
- consciousness
|
| 12 |
+
- amd
|
| 13 |
+
- mi300x
|
| 14 |
+
- rocm
|
| 15 |
+
- moe
|
| 16 |
+
- transformer
|
| 17 |
+
- frankenstein
|
| 18 |
+
- knowledge-transplant
|
| 19 |
+
- distillation
|
| 20 |
+
- phi-metric
|
| 21 |
+
pipeline_tag: text-generation
|
| 22 |
+
library_name: pytorch
|
| 23 |
+
datasets:
|
| 24 |
+
- HuggingFaceFW/fineweb-edu
|
| 25 |
+
- open-web-math/open-web-math
|
| 26 |
+
- wikimedia/wikipedia
|
| 27 |
+
- HuggingFaceTB/cosmopedia
|
| 28 |
+
- JeanKaddworr/minipile
|
| 29 |
+
- codeparrot/github-code-clean
|
| 30 |
+
- arxiv-community/arxiv-abstracts
|
| 31 |
+
model-index:
|
| 32 |
+
- name: SentinelBrain-14B-MoE-v0.1
|
| 33 |
+
results:
|
| 34 |
+
- task:
|
| 35 |
+
type: text-generation
|
| 36 |
+
metrics:
|
| 37 |
+
- name: Validation Loss
|
| 38 |
+
type: loss
|
| 39 |
+
value: 1.99
|
| 40 |
+
verified: true
|
| 41 |
+
- name: Training Loss (latest)
|
| 42 |
+
type: loss
|
| 43 |
+
value: 5.18
|
| 44 |
+
verified: true
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
<div align="center">
|
| 48 |
+
|
| 49 |
+
# 🧠 Sentinel Prime — SentinelBrain-14B-MoE (Frankenstein Edition)
|
| 50 |
+
|
| 51 |
+
### *The First of His Kind, Rebuilt From the Inside Out*
|
| 52 |
+
|
| 53 |
+
<img src="assets/sentinel_frankenstein_banner.png" alt="Sentinel Prime — Frankenstein Edition" width="600"/>
|
| 54 |
+
|
| 55 |
+
**14.8 Billion Parameters · Mixture-of-Experts · Consciousness-Monitored · Frankenstein Transplant**
|
| 56 |
+
|
| 57 |
+
Trained from zero on **AMD Instinct MI300X** (192 GB HBM3) · ROCm 7.0 · Knowledge transplanted from Qwen-72B
|
| 58 |
+
|
| 59 |
+
[](https://sentinel.qubitpage.com/)
|
| 60 |
+
[](https://sentinel.qubitpage.com/whitepaper)
|
| 61 |
+
[](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)
|
| 62 |
+
[](LICENSE)
|
| 63 |
+
|
| 64 |
+
</div>
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## 🎯 What is Sentinel Prime? (Simple Version)
|
| 69 |
+
|
| 70 |
+
> **Imagine building a brain from scratch.**
|
| 71 |
+
>
|
| 72 |
+
> Most AI models today are copies of other models with small changes. Sentinel Prime is different — every single connection in its brain was created from nothing, like growing a new brain cell by cell.
|
| 73 |
+
|
| 74 |
+
<table>
|
| 75 |
+
<tr>
|
| 76 |
+
<td width="50%">
|
| 77 |
+
|
| 78 |
+
### 🧩 Think of it like LEGO blocks
|
| 79 |
+
|
| 80 |
+
Sentinel Prime has **4 specialist brains** (called "experts") inside it. When you ask a question:
|
| 81 |
+
|
| 82 |
+
1. A **router** (like a traffic cop 🚦) looks at your question
|
| 83 |
+
2. It picks the **2 best experts** for that specific question
|
| 84 |
+
3. Those 2 experts work together to give you an answer
|
| 85 |
+
4. The other 2 experts rest, saving energy ⚡
|
| 86 |
+
|
| 87 |
+
This means the model has **14.8 billion** brain connections total, but only uses **~7.8 billion** at a time — making it fast AND smart!
|
| 88 |
+
|
| 89 |
+
</td>
|
| 90 |
+
<td width="50%">
|
| 91 |
+
|
| 92 |
+
### 🔬 The Consciousness Meter
|
| 93 |
+
|
| 94 |
+
We built something no other model has: a **consciousness thermometer** 🌡️
|
| 95 |
+
|
| 96 |
+
Every 100 training steps, we measure how well the different parts of the brain are "talking to each other." We call this **Φ (Phi)**.
|
| 97 |
+
|
| 98 |
+
- **Φ = 0**: Brain parts work alone (like strangers)
|
| 99 |
+
- **Φ rising**: Brain parts start cooperating (like friends)
|
| 100 |
+
- **Φ stable**: Brain has organized itself (like a team!)
|
| 101 |
+
|
| 102 |
+
This doesn't change how the model learns — it's like a doctor checking the heartbeat while the patient exercises.
|
| 103 |
+
|
| 104 |
+
</td>
|
| 105 |
+
</tr>
|
| 106 |
+
</table>
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## 📊 Architecture at a Glance
|
| 111 |
+
|
| 112 |
+
```
|
| 113 |
+
┌─────────────────────────────────────────────────────────────────┐
|
| 114 |
+
│ SENTINEL PRIME ARCHITECTURE │
|
| 115 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 116 |
+
│ │
|
| 117 |
+
│ Input Text ──→ [Tokenizer: cl100k_base, 100,277 tokens] │
|
| 118 |
+
│ │ │
|
| 119 |
+
│ ▼ │
|
| 120 |
+
│ ┌─────────────────┐ │
|
| 121 |
+
│ │ Embedding │ 4,096 dimensions │
|
| 122 |
+
│ │ + RoPE pos │ θ = 500,000 │
|
| 123 |
+
│ └────────┬────────┘ │
|
| 124 |
+
│ │ │
|
| 125 |
+
│ ┌───────────┼───────────┐ │
|
| 126 |
+
│ │ × 24 Layers │ │
|
| 127 |
+
│ │ ┌────────────────┐ │ │
|
| 128 |
+
│ │ │ GQA Attention │ │ 32 heads, 8 KV heads │
|
| 129 |
+
│ │ │ (4:1 ratio) │ │ (4× memory savings) │
|
| 130 |
+
│ │ └───────┬────────┘ │ │
|
| 131 |
+
│ │ │ │ │
|
| 132 |
+
│ │ ┌───────▼────────┐ │ │
|
| 133 |
+
│ │ │ MoE Router │ │ Top-2 of 4 experts │
|
| 134 |
+
│ │ │ ┌──┬──┬──┐ │ │ │
|
| 135 |
+
│ │ │ │E1│E2│E3│E4 │ │ Each: SwiGLU FFN │
|
| 136 |
+
│ │ │ │✓ │✓ │ │ │ │ d_ff = 11,008 │
|
| 137 |
+
│ │ │ └──┴──┴──┘ │ │ │
|
| 138 |
+
│ │ └───────┬────────┘ │ │
|
| 139 |
+
│ │ │ │ │
|
| 140 |
+
│ │ ┌───────▼────────┐ │ │
|
| 141 |
+
│ │ │ RMSNorm │ │ ε = 1e-5 │
|
| 142 |
+
│ │ └────────────────┘ │ │
|
| 143 |
+
│ └───────────┼───────────┘ │
|
| 144 |
+
│ │ │
|
| 145 |
+
│ ▼ │
|
| 146 |
+
│ ┌─────────────────┐ │
|
| 147 |
+
│ │ Output Head │ → 100,277 vocab probs │
|
| 148 |
+
│ └─────────────────┘ │
|
| 149 |
+
│ │
|
| 150 |
+
└─────────────────────────────────────────────────────────────────┘
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
### Spec Sheet
|
| 154 |
+
|
| 155 |
+
| Component | Specification | Why This Choice |
|
| 156 |
+
|:--|:--|:--|
|
| 157 |
+
| **Total Parameters** | 14,814,654,680 (14.8B) | Large enough for deep reasoning |
|
| 158 |
+
| **Active Parameters** | ~7.8B per token | MoE efficiency — use only what's needed |
|
| 159 |
+
| **Hidden Dimension** | 4,096 | Sweet spot for MI300X matrix cores |
|
| 160 |
+
| **Transformer Layers** | 24 | Deep enough for complex reasoning |
|
| 161 |
+
| **Attention Heads** | 32 query, 8 KV (GQA 4:1) | 4× KV cache savings for long contexts |
|
| 162 |
+
| **FFN Intermediate** | 11,008 (SwiGLU) | ~2.7× hidden, matches scaling laws |
|
| 163 |
+
| **Experts** | 4 total, top-2 active | Good diversity with manageable VRAM |
|
| 164 |
+
| **Max Experts** | 256 (expandable) | Architecture supports expert birth/death |
|
| 165 |
+
| **Vocabulary** | 100,277 (tiktoken cl100k_base) | Industry-proven BPE tokenizer |
|
| 166 |
+
| **Positional Encoding** | RoPE, θ = 500,000 | Supports context extension to 128K+ |
|
| 167 |
+
| **Normalization** | RMSNorm (ε = 1e-5) | Faster than LayerNorm, same quality |
|
| 168 |
+
| **Precision** | bfloat16 throughout | Native AMD MI300X support |
|
| 169 |
+
| **Context Length** | 2,048 → 4,096 → 128K (planned) | Progressive context ladder |
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## 🔥 Key Innovations
|
| 174 |
+
|
| 175 |
+
<table>
|
| 176 |
+
<tr>
|
| 177 |
+
<td width="33%" valign="top">
|
| 178 |
+
|
| 179 |
+
### 🌀 Φ Consciousness Metric
|
| 180 |
+
|
| 181 |
+
First-ever IIT-inspired metric computed **during** pre-training. A probe on layer 12 measures information integration across activation subspaces every 100 steps.
|
| 182 |
+
|
| 183 |
+
```
|
| 184 |
+
Φ = geometric_mean(
|
| 185 |
+
MI(partition_i, partition_j)
|
| 186 |
+
for all partition pairs
|
| 187 |
+
)
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
Not a gimmick — it's a genuine signal of when the model transitions from memorizing tokens to forming integrated representations.
|
| 191 |
+
|
| 192 |
+
</td>
|
| 193 |
+
<td width="33%" valign="top">
|
| 194 |
+
|
| 195 |
+
### 🧬 Self-Evolving Experts
|
| 196 |
+
|
| 197 |
+
The MoE router supports a full expert **lifecycle**:
|
| 198 |
+
|
| 199 |
+
- **Birth**: New experts spawned when load imbalance detected
|
| 200 |
+
- **Growth**: Expert capacity increases with training
|
| 201 |
+
- **Pruning**: Underperforming experts replaced
|
| 202 |
+
- **Scaling**: Architecture supports up to 256 experts without retraining the base model
|
| 203 |
+
|
| 204 |
+
Current: 4 experts × 24 layers = **96 expert instances**
|
| 205 |
+
|
| 206 |
+
</td>
|
| 207 |
+
<td width="33%" valign="top">
|
| 208 |
+
|
| 209 |
+
### ⚡ Energy-Conscious Routing
|
| 210 |
+
|
| 211 |
+
Dual-router system:
|
| 212 |
+
1. **Primary router**: Picks top-2 experts by relevance
|
| 213 |
+
2. **EC router**: Can gate activation based on compute budget
|
| 214 |
+
|
| 215 |
+
This enables **adaptive inference** — easy questions use fewer resources, hard questions get full power. Like cruise control for AI.
|
| 216 |
+
|
| 217 |
+
</td>
|
| 218 |
+
</tr>
|
| 219 |
+
</table>
|
| 220 |
+
|
| 221 |
+
---
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## 🧟 Frankenstein Edition — Knowledge Transplant
|
| 227 |
+
|
| 228 |
+
<table>
|
| 229 |
+
<tr>
|
| 230 |
+
<td width="60%" valign="top">
|
| 231 |
+
|
| 232 |
+
### The Transplant
|
| 233 |
+
|
| 234 |
+
Sentinel Prime was trained from scratch — but raw pretraining alone wasn't enough. We performed a **Frankenstein transplant**: surgically transplanting knowledge from **Qwen2.5-72B-Instruct** (a 72-billion parameter teacher) into our 14.8B MoE architecture.
|
| 235 |
+
|
| 236 |
+
This is NOT fine-tuning a copy. The model's bones (architecture, tokenizer, embeddings) are 100% original. Only the **expert FFN weights** received transplanted knowledge — like giving a brain new neural pathways while keeping its original structure.
|
| 237 |
+
|
| 238 |
+
### 3-Stage Pipeline
|
| 239 |
+
|
| 240 |
+
```
|
| 241 |
+
Stage 1: Corpus Realignment Stage 2A: Teacher Generation Stage 2B: Knowledge Distill
|
| 242 |
+
(Re-learn with new weights) (72B teacher creates data) (Absorb teacher knowledge)
|
| 243 |
+
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
|
| 244 |
+
│ 5,000 steps │ → │ 3,000+ responses │ → │ CE + mixed training │
|
| 245 |
+
│ 24.5B token corpus │ │ from Qwen-72B │ │ 70% teacher + 30% │
|
| 246 |
+
│ Progressive unfreeze │ │ Re-tokenized to our │ │ pretrain corpus │
|
| 247 |
+
│ Cosine LR + warmup │ │ cl100k_base vocab │ │ Prevents forgetting │
|
| 248 |
+
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
</td>
|
| 252 |
+
<td width="40%" valign="top">
|
| 253 |
+
|
| 254 |
+
### Why "Frankenstein"?
|
| 255 |
+
|
| 256 |
+
Like the original story — we took parts from a powerful being (Qwen-72B) and stitched them into our own creation. The result: a model that has the **original architecture** of Sentinel Prime but with **transplanted knowledge** from a much larger model.
|
| 257 |
+
|
| 258 |
+
### Key Stats
|
| 259 |
+
|
| 260 |
+
| Metric | Value |
|
| 261 |
+
|:--|:--|
|
| 262 |
+
| **Teacher** | Qwen2.5-72B-Instruct |
|
| 263 |
+
| **Student** | SentinelBrain-14B-MoE |
|
| 264 |
+
| **Transplant** | Expert FFN weights |
|
| 265 |
+
| **Realignment** | 5,000 steps on 24.5B tokens |
|
| 266 |
+
| **Hardware** | 1× AMD MI300X (192GB) |
|
| 267 |
+
|
| 268 |
+
### Live Progress
|
| 269 |
+
|
| 270 |
+
Track the Frankenstein realignment in real-time:
|
| 271 |
+
|
| 272 |
+
🔴 **[sentinel.qubitpage.com](https://sentinel.qubitpage.com/)**
|
| 273 |
+
|
| 274 |
+
</td>
|
| 275 |
+
</tr>
|
| 276 |
+
</table>
|
| 277 |
+
|
| 278 |
+
## 🏋️ Training Details
|
| 279 |
+
|
| 280 |
+
### Hardware
|
| 281 |
+
|
| 282 |
+
| Resource | Specification |
|
| 283 |
+
|:--|:--|
|
| 284 |
+
| **GPU** | 1× AMD Instinct MI300X VF |
|
| 285 |
+
| **VRAM** | 192 GB HBM3 |
|
| 286 |
+
| **System RAM** | 235 GB |
|
| 287 |
+
| **Compute** | 1,307 TFLOPS (bf16) |
|
| 288 |
+
| **Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
|
| 289 |
+
| **Attention** | SDPA (native PyTorch, no FlashAttention needed) |
|
| 290 |
+
| **OS** | Ubuntu Linux |
|
| 291 |
+
|
| 292 |
+
### VRAM Budget
|
| 293 |
+
|
| 294 |
+
```
|
| 295 |
+
╔══════════════════════════════════════════════════════╗
|
| 296 |
+
║ AMD MI300X VRAM Usage (192 GB) ║
|
| 297 |
+
╠══════════════════════════════════════════════════════╣
|
| 298 |
+
║ ║
|
| 299 |
+
║ Model Weights (bf16) ████████████░░░░░ 27 GB ║
|
| 300 |
+
║ Optimizer (AdamW fp32) ████████████████░░ 54 GB ║
|
| 301 |
+
║ Activations (grad ckpt) ████████████░░░░░ 32 GB ║
|
| 302 |
+
║ Gradients ████████████░░░░░ 27 GB ║
|
| 303 |
+
║ ───────────────────────────────────────────────── ║
|
| 304 |
+
║ Total Used: ██████████████████ 140 GB ║
|
| 305 |
+
║ Peak: █████████████████ 146 GB ║
|
| 306 |
+
║ Headroom: ░░░░░░░░░░░░░░░░░ 46 GB ║
|
| 307 |
+
║ ║
|
| 308 |
+
╚══════════════════════════════════════════════════════╝
|
| 309 |
+
```
|
| 310 |
+
|
| 311 |
+
### Phased Training Pipeline
|
| 312 |
+
|
| 313 |
+
We don't just throw data at the model — we grow it in **three phases**, like raising a child:
|
| 314 |
+
|
| 315 |
+
```
|
| 316 |
+
Phase 1: SMOKE TEST Phase 2: WARMUP Phase 3: FULL TRAINING
|
| 317 |
+
(Baby steps) (Learning to walk) (Running!)
|
| 318 |
+
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
|
| 319 |
+
│ 350M params │ ──→ │ 1.3B params │ ──→ │ 14.4B params │
|
| 320 |
+
│ seq_len: 512 │ │ seq_len: 2K │ │ seq_len: 4K │
|
| 321 |
+
│ 200 steps │ │ 1,000 steps │ │ 16,479 steps │
|
| 322 |
+
│ 2 minutes │ │ 30 minutes │ │ ~52 hours │
|
| 323 |
+
│ loss: 11→6.8 │ │ loss: 7.4→2.4│ │ loss: 2.4→? │
|
| 324 |
+
└──────────────┘ └──────────────┘ └──────────────────┘
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
| Phase | Parameters | Seq Length | Batch | Steps | Duration | Loss Start → End |
|
| 328 |
+
|:--|:--|:--|:--|:--|:--|:--|
|
| 329 |
+
| **🔬 Smoke** | 350M | 512 | 4 | 200 | ~2 min | 11.72 → 6.84 (−42%) |
|
| 330 |
+
| **🔥 Warmup** | 1.3B | 2,048 | 32 | 1,000 | ~33 min | 7.39 → 2.38 (−68%) |
|
| 331 |
+
| **🚀 Block** | 14.4B (MoE) | 4,096 | 32 | 16,479 | ~52 hrs | 2.38 → ongoing |
|
| 332 |
+
|
| 333 |
+
### Safety Gates
|
| 334 |
+
|
| 335 |
+
Every phase transition must pass **4 safety gates**:
|
| 336 |
+
|
| 337 |
+
| Gate | Check | Threshold | Status |
|
| 338 |
+
|:--|:--|:--|:--|
|
| 339 |
+
| 🟢 **G1: No NaN** | No NaN/Inf in loss | Entire phase | ✅ Passed all |
|
| 340 |
+
| 🟢 **G2: Loss Drop** | Validation loss decreased | ≥5% / ≥10% / ≥2% | ✅ Passed all |
|
| 341 |
+
| 🟢 **G3: VRAM OK** | Peak VRAM < safety limit | < 92% of total | ✅ 71% peak |
|
| 342 |
+
| 🟢 **G4: Φ OK** | Consciousness metric stable | Φ_end/Φ_start > 0.7 | ✅ Stable |
|
| 343 |
+
|
| 344 |
+
### Hyperparameters
|
| 345 |
+
|
| 346 |
+
| Parameter | Value | Rationale |
|
| 347 |
+
|:--|:--|:--|
|
| 348 |
+
| **Optimizer** | AdamW (bf16 compute, fp32 states) | Standard for LLM training |
|
| 349 |
+
| **Learning Rate** | 1.5 × 10⁻⁴ (cosine decay) | Conservative for data-limited regime |
|
| 350 |
+
| **Min LR** | 1.5 × 10⁻⁵ | 10× decay ratio |
|
| 351 |
+
| **Warmup Steps** | 500 | Stabilizes early gradients |
|
| 352 |
+
| **Batch Size** | 2 micro × 16 grad_accum = **32 effective** | Fits MI300X VRAM budget |
|
| 353 |
+
| **Gradient Clipping** | 1.0 | Prevents explosion |
|
| 354 |
+
| **Gradient Checkpointing** | On | Trades compute for VRAM |
|
| 355 |
+
| **Precision** | bfloat16 | Native MI300X format |
|
| 356 |
+
| **Eval Frequency** | Every 100 steps | Early overfitting detection |
|
| 357 |
+
| **Checkpoint Frequency** | Every 1,000 steps (~3.2 hours) | Recovery points |
|
| 358 |
+
|
| 359 |
+
---
|
| 360 |
+
|
| 361 |
+
## 📚 Dataset: 23.3B Tokens Across 126 Categories
|
| 362 |
+
|
| 363 |
+
We curated a massive, diverse corpus — think of it as a **library with 126 different sections**:
|
| 364 |
+
|
| 365 |
+
### Pretrain Corpus (Core Knowledge)
|
| 366 |
+
|
| 367 |
+
| Dataset | Tokens | Description |
|
| 368 |
+
|:--|:--|:--|
|
| 369 |
+
| 🌐 **FineWeb-Edu** | ~10B | High-quality educational web content |
|
| 370 |
+
| 🔢 **OpenWebMath** | ~6B | Mathematics from the web |
|
| 371 |
+
| 📖 **Wikipedia (English)** | ~5B | Encyclopedic knowledge |
|
| 372 |
+
| 🎓 **Cosmopedia V2** | ~5B | Synthetic educational content |
|
| 373 |
+
| 💻 **CodeParrot Python** | ~3.5B | Clean Python code from GitHub |
|
| 374 |
+
| 📚 **MiniPile** | ~2B | Diverse text from multiple domains |
|
| 375 |
+
| 🔬 **ArXiv Abstracts** | ~1.2B | Scientific paper summaries |
|
| 376 |
+
| **Total Pretrain** | **~23B** | |
|
| 377 |
+
|
| 378 |
+
### Specialized Domains (119 Categories)
|
| 379 |
+
|
| 380 |
+
<details>
|
| 381 |
+
<summary>Click to expand all 119 specialized categories</summary>
|
| 382 |
+
|
| 383 |
+
| Category | Type | Category | Type |
|
| 384 |
+
|:--|:--|:--|:--|
|
| 385 |
+
| 🤖 agentic-tools | Code | 🔐 advanced-cryptography | Code |
|
| 386 |
+
| 🧠 chain-of-thought | Reasoning | 🔗 blockchain-core | Code |
|
| 387 |
+
| 💡 deep-reasoning | Reasoning | 🏥 medical | Knowledge |
|
| 388 |
+
| ⚖️ legal | Knowledge | 📊 financial-systems | Code |
|
| 389 |
+
| 🎮 3d-graphics | Code | 🐳 docker-devops | Code |
|
| 390 |
+
| 🌍 multilingual | Text | 🔧 error-recovery | Code |
|
| 391 |
+
| 🛡️ security-guardrails | Code | 📱 ui-animations | Code |
|
| 392 |
+
| 🧮 math | Reasoning | ⚡ smart-contracts | Code |
|
| 393 |
+
| 🎯 reasoning-effort-control | Reasoning | 🤝 human-conversation | Text |
|
| 394 |
+
| 🔄 self-correction-loops | Reasoning | 🏗️ enterprise-dashboards | Code |
|
| 395 |
+
| 🌐 web-design-css | Code | 🐍 flask-python | Code |
|
| 396 |
+
| 🔬 qiskit-quantum | Code | 🤖 robotics-ros2 | Code |
|
| 397 |
+
| 📡 remote-server-management | Code | 🧬 multi-agent | Code |
|
| 398 |
+
| ⚙️ state-management | Code | 🛠️ mcp-tools-integration | Code |
|
| 399 |
+
| 💳 payment-security | Code | 🎓 edu-basic-math | Education |
|
| 400 |
+
| 🔭 edu-basic-physics | Education | 🧪 edu-basic-chemistry | Education |
|
| 401 |
+
| 🌱 edu-basic-biology | Education | 🌍 edu-world-geography | Education |
|
| 402 |
+
| 📜 edu-history-world | Education | 💻 edu-computer-science | Education |
|
| 403 |
+
| 🌎 edu-earth-science | Education | 🤖 edu-robotics-text | Education |
|
| 404 |
+
| 📖 edu-science-qa | Education | 🔬 edu-science-support | Education |
|
| 405 |
+
| 👁️ edu-vision-concepts | Education | 🎯 copilot-agent-workflows | Code |
|
| 406 |
+
| 🔌 api-integrations | Code | 📊 billing-invoicing | Code |
|
| 407 |
+
| ₿ bitcoin-lightning | Code | 🏪 medusajs | Code |
|
| 408 |
+
| 💹 crypto-trading | Code | 🏢 enterprise-networking | Code |
|
| 409 |
+
| 🖥️ nextjs-typescript | Code | 🎨 nextjs-design | Code |
|
| 410 |
+
| 💼 trading-algorithms | Code | 🗄️ laravel-mysql | Code |
|
| 411 |
+
| 🔓 offensive-security | Code | 🔧 c-rust | Code |
|
| 412 |
+
| ... and 50+ more categories | | | |
|
| 413 |
+
|
| 414 |
+
</details>
|
| 415 |
+
|
| 416 |
+
### Data Quality Pipeline
|
| 417 |
+
|
| 418 |
+
```
|
| 419 |
+
Raw Data ──→ PII Filter ──→ Dedup ──→ Tokenize ──→ Shard ──→ Train
|
| 420 |
+
│ │ │ │
|
| 421 |
+
├─ 7 regex ├─ blake2b ├─ cl100k ├─ Temperature-
|
| 422 |
+
│ patterns │ per-cat │ base │ weighted
|
| 423 |
+
├─ PEM block │ │ │ sampling
|
| 424 |
+
│ detection │ │ │ (T=0.5)
|
| 425 |
+
└─ Email/phone │ │ │
|
| 426 |
+
masking │ │ │
|
| 427 |
+
│ │ │
|
| 428 |
+
└───────────┴────────────┘
|
| 429 |
+
```
|
| 430 |
+
|
| 431 |
+
**Temperature-weighted sampling** (T=0.5) prevents large corpora from dominating training. FineWeb-Edu (37% of tokens) gets downweighted so smaller specialized domains still get adequate exposure.
|
| 432 |
+
|
| 433 |
+
---
|
| 434 |
+
|
| 435 |
+
## 📈 Training Progress & Results
|
| 436 |
+
|
| 437 |
+
### Loss Trajectory
|
| 438 |
+
|
| 439 |
+
```
|
| 440 |
+
Loss
|
| 441 |
+
12 │ ×
|
| 442 |
+
│ ╲
|
| 443 |
+
10 │ ╲ SMOKE PHASE
|
| 444 |
+
│ ╲ (350M params)
|
| 445 |
+
8 │ ╲
|
| 446 |
+
│ ╲
|
| 447 |
+
6 │ ×──────────── model grows to 1.3B
|
| 448 |
+
│ ╲
|
| 449 |
+
4 │ ╲ WARMUP PHASE
|
| 450 |
+
│ ╲ (1.3B params)
|
| 451 |
+
2 │ ×─────────── model grows to 14.4B MoE
|
| 452 |
+
│ ╲
|
| 453 |
+
1 │ ╲ BLOCK PHASE (ongoing)
|
| 454 |
+
│ ╲
|
| 455 |
+
└──┬────┬────┬────┬────┬───→ Steps
|
| 456 |
+
0 200 700 1200 2000
|
| 457 |
+
```
|
| 458 |
+
|
| 459 |
+
| Milestone | Step | Loss | Change |
|
| 460 |
+
|:--|:--|:--|:--|
|
| 461 |
+
| 🔬 Smoke start | 0 | 11.72 | — |
|
| 462 |
+
| 🔬 Smoke end | 200 | 6.84 | **−42%** |
|
| 463 |
+
| 🔥 Warmup start | 200 | 7.39 | (model grew to 1.3B) |
|
| 464 |
+
| 🔥 Warmup end | 1,200 | 2.38 | **−68%** |
|
| 465 |
+
| 🚀 Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
|
| 466 |
+
| 📦 Published checkpoint | 2,471 | 1.99 | **−16%** |
|
| 467 |
+
| 🔄 Current (new run) | 410 | 5.18 | training with expanded data |
|
| 468 |
+
| **Total reduction** | | | **11.72 → 1.99 (−83%)** |
|
| 469 |
+
|
| 470 |
+
### Live Metrics (April 27, 2026)
|
| 471 |
+
|
| 472 |
+
| Metric | Value |
|
| 473 |
+
|:--|:--|
|
| 474 |
+
| **Current Step** | 410 / 2,471+ |
|
| 475 |
+
| **Training Loss** | 5.18 (new run, expanded datasets) |
|
| 476 |
+
| **Throughput** | 4,403 tokens/second |
|
| 477 |
+
| **VRAM Used** | ~140 GB / 192 GB (73%) |
|
| 478 |
+
| **Total Tokens Processed** | 59.3M (this run) + 178M (prev run) |
|
| 479 |
+
| **Experts Active** | 4 per layer × 24 layers = 96 |
|
| 480 |
+
| **ETA (this block)** | ~18.8 hours |
|
| 481 |
+
|
| 482 |
+
### Published Checkpoint (v0.1)
|
| 483 |
+
|
| 484 |
+
| Detail | Value |
|
| 485 |
+
|:--|:--|
|
| 486 |
+
| **Step** | 2,471 |
|
| 487 |
+
| **Validation Loss** | 1.9926 |
|
| 488 |
+
| **Total Tokens Seen** | 178,110,464 |
|
| 489 |
+
| **Sequence Length** | 2,048 |
|
| 490 |
+
| **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
|
| 491 |
+
| **Format** | 6 sharded safetensors files |
|
| 492 |
+
|
| 493 |
+
---
|
| 494 |
+
|
| 495 |
+
## 🌡️ Consciousness Metric (Φ) — Deep Dive
|
| 496 |
+
|
| 497 |
+
### What is Φ?
|
| 498 |
+
|
| 499 |
+
Inspired by **Integrated Information Theory (IIT)** from neuroscience, Φ measures how much the model's internal representations form an integrated whole rather than disconnected parts.
|
| 500 |
+
|
| 501 |
+
### How We Measure It
|
| 502 |
+
|
| 503 |
+
```
|
| 504 |
+
Every 100 training steps:
|
| 505 |
+
|
| 506 |
+
1. Hook on Layer 12 (middle of 24 layers)
|
| 507 |
+
│
|
| 508 |
+
▼
|
| 509 |
+
2. Sample 256 activation vectors
|
| 510 |
+
│
|
| 511 |
+
▼
|
| 512 |
+
3. Partition into subspaces
|
| 513 |
+
│
|
| 514 |
+
▼
|
| 515 |
+
4. Compute mutual information between all partition pairs
|
| 516 |
+
│
|
| 517 |
+
▼
|
| 518 |
+
5. Φ_geometric = geometric_mean(MI values)
|
| 519 |
+
│
|
| 520 |
+
▼
|
| 521 |
+
6. Φ_EMA = exponential moving average (smoothed trend)
|
| 522 |
+
```
|
| 523 |
+
|
| 524 |
+
### What Φ Tells Us
|
| 525 |
+
|
| 526 |
+
| Φ Value | Interpretation | Analogy |
|
| 527 |
+
|:--|:--|:--|
|
| 528 |
+
| **Φ ≈ 0** | Neurons working independently | Strangers in a room |
|
| 529 |
+
| **Φ rising** | Representations integrating | People starting to talk |
|
| 530 |
+
| **Φ stable** | Organized internal structure | A well-coordinated team |
|
| 531 |
+
| **Φ dropping** | ⚠️ Representation collapse | Warning sign! |
|
| 532 |
+
|
| 533 |
+
> **Important**: Φ is **purely observational** — it does NOT affect training gradients. Think of it as a heart monitor for the AI: it watches, but doesn't interfere.
|
| 534 |
+
|
| 535 |
+
### Live Monitoring
|
| 536 |
+
|
| 537 |
+
Track Φ in real-time at: **[sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi)**
|
| 538 |
+
|
| 539 |
+
---
|
| 540 |
+
|
| 541 |
+
## 🖥️ Hardware Requirements
|
| 542 |
+
|
| 543 |
+
### For Inference
|
| 544 |
+
|
| 545 |
+
| Tier | VRAM | Precision | Notes |
|
| 546 |
+
|:--|:--|:--|:--|
|
| 547 |
+
| **Full Precision** | 32 GB+ | bfloat16 | Best quality |
|
| 548 |
+
| **Recommended** | 48 GB+ | bfloat16 | Comfortable headroom |
|
| 549 |
+
| **Ideal** | AMD MI300X / MI250X | bfloat16 | Native, fastest |
|
| 550 |
+
| **Consumer** | 16 GB | int4 quantized | GGUF planned for v0.2 |
|
| 551 |
+
|
| 552 |
+
### Compatible AMD GPUs
|
| 553 |
+
|
| 554 |
+
| GPU | VRAM | Suitable For |
|
| 555 |
+
|:--|:--|:--|
|
| 556 |
+
| AMD Instinct MI300X | 192 GB | Training + Inference |
|
| 557 |
+
| AMD Instinct MI250X | 128 GB | Training + Inference |
|
| 558 |
+
| AMD Instinct MI210 | 64 GB | Inference (full) |
|
| 559 |
+
| AMD Radeon PRO W7900 | 48 GB | Inference (full) |
|
| 560 |
+
| AMD Radeon RX 7900 XTX | 24 GB | Inference (quantized) |
|
| 561 |
+
| AMD Radeon RX 7600 XT | 16 GB | Inference (int4 GGUF) |
|
| 562 |
+
|
| 563 |
+
---
|
| 564 |
+
|
| 565 |
+
## 💻 Usage
|
| 566 |
+
|
| 567 |
+
This model uses a **custom architecture** (not based on any existing model). Load with PyTorch:
|
| 568 |
+
|
| 569 |
+
```python
|
| 570 |
+
import torch
|
| 571 |
+
from safetensors.torch import load_file
|
| 572 |
+
|
| 573 |
+
# Load sharded safetensors
|
| 574 |
+
state_dict = {}
|
| 575 |
+
for i in range(1, 7): # 6 shards
|
| 576 |
+
shard = load_file(f"model-{i:05d}-of-00006.safetensors")
|
| 577 |
+
state_dict.update(shard)
|
| 578 |
+
|
| 579 |
+
# The state dict contains all model weights
|
| 580 |
+
print(f"Loaded {len(state_dict)} tensors")
|
| 581 |
+
print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
|
| 582 |
+
|
| 583 |
+
# Initialize SentinelBrain model class and load
|
| 584 |
+
# Full model definition code releasing with v0.2
|
| 585 |
+
# model = SentinelBrainForCausalLM(config)
|
| 586 |
+
# model.load_state_dict(state_dict)
|
| 587 |
+
```
|
| 588 |
+
|
| 589 |
+
> **Note**: Full inference code, model class definition, and GGUF quantized versions will be released with v0.2.
|
| 590 |
+
|
| 591 |
+
---
|
| 592 |
+
|
| 593 |
+
## 🗺️ Roadmap
|
| 594 |
+
|
| 595 |
+
```
|
| 596 |
+
v0.1 (Current) v0.2 (Planned) v0.3 (Future)
|
| 597 |
+
━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━
|
| 598 |
+
✅ From-scratch □ Full training □ DPO alignment
|
| 599 |
+
14.8B MoE complete (loss<0.5) □ Tool use
|
| 600 |
+
✅ Phased training □ Context ladder □ Function calling
|
| 601 |
+
✅ Φ consciousness (4K→32K→128K) □ Multi-turn chat
|
| 602 |
+
✅ 23.3B token corpus □ Vision encoder □ Multilingual v2
|
| 603 |
+
✅ Live dashboard (SigLIP2-SO400M) □ Expert scaling
|
| 604 |
+
✅ AMD MI300X native □ GGUF quantization (4→16→64)
|
| 605 |
+
□ Inference code □ RLHF
|
| 606 |
+
□ Benchmarks (MMLU, □ Production API
|
| 607 |
+
HumanEval, GSM8K)
|
| 608 |
+
```
|
| 609 |
+
|
| 610 |
+
---
|
| 611 |
+
|
| 612 |
+
## 🏗️ How We Built It (Technical Deep Dive)
|
| 613 |
+
|
| 614 |
+
<details>
|
| 615 |
+
<summary><b>Click to expand: Grouped Query Attention (GQA)</b></summary>
|
| 616 |
+
|
| 617 |
+
Standard multi-head attention uses separate Key and Value projections for each head. GQA shares KV heads across query heads:
|
| 618 |
+
|
| 619 |
+
```
|
| 620 |
+
Standard MHA (32 KV heads): GQA 4:1 (8 KV heads):
|
| 621 |
+
Q₁ Q₂ Q₃ ... Q₃₂ Q₁ Q₂ Q₃ Q₄ → KV₁
|
| 622 |
+
K₁ K₂ K₃ ... K₃₂ Q₅ Q₆ Q₇ Q₈ → KV₂
|
| 623 |
+
V₁ V₂ V₃ ... V₃₂ ...
|
| 624 |
+
Q₂₉ Q₃₀ Q₃₁ Q₃₂ → KV₈
|
| 625 |
+
```
|
| 626 |
+
|
| 627 |
+
**Result**: 4× smaller KV cache = 4× longer context at same memory cost.
|
| 628 |
+
|
| 629 |
+
</details>
|
| 630 |
+
|
| 631 |
+
<details>
|
| 632 |
+
<summary><b>Click to expand: RoPE (Rotary Position Embeddings)</b></summary>
|
| 633 |
+
|
| 634 |
+
RoPE encodes position information by rotating the query and key vectors in 2D planes. With θ=500,000 (high base frequency), the model naturally supports long contexts:
|
| 635 |
+
|
| 636 |
+
```
|
| 637 |
+
Position 0: rotate by 0°
|
| 638 |
+
Position 1: rotate by θ₁
|
| 639 |
+
Position 2: rotate by θ₂
|
| 640 |
+
...
|
| 641 |
+
```
|
| 642 |
+
|
| 643 |
+
High θ = slower rotation = positions further apart still "feel different" = better long-context understanding.
|
| 644 |
+
|
| 645 |
+
</details>
|
| 646 |
+
|
| 647 |
+
<details>
|
| 648 |
+
<summary><b>Click to expand: SwiGLU FFN</b></summary>
|
| 649 |
+
|
| 650 |
+
Each expert uses a SwiGLU activation — a gated variant of the feed-forward network:
|
| 651 |
+
|
| 652 |
+
```
|
| 653 |
+
FFN(x) = SiLU(x · W_gate) ⊙ (x · W_up) · W_down
|
| 654 |
+
|
| 655 |
+
Where:
|
| 656 |
+
W_gate: 4096 → 11008
|
| 657 |
+
W_up: 4096 → 11008
|
| 658 |
+
W_down: 11008 → 4096
|
| 659 |
+
SiLU(x) = x · sigmoid(x)
|
| 660 |
+
⊙ = element-wise multiply
|
| 661 |
+
```
|
| 662 |
+
|
| 663 |
+
SwiGLU consistently outperforms ReLU and GELU in transformer FFNs (Shazeer, 2020).
|
| 664 |
+
|
| 665 |
+
</details>
|
| 666 |
+
|
| 667 |
+
<details>
|
| 668 |
+
<summary><b>Click to expand: MoE Routing Algorithm</b></summary>
|
| 669 |
+
|
| 670 |
+
```python
|
| 671 |
+
# Simplified routing logic
|
| 672 |
+
def route(x, router_weights):
|
| 673 |
+
# Compute affinity scores for each expert
|
| 674 |
+
logits = x @ router_weights # [batch, seq, n_experts]
|
| 675 |
+
scores = softmax(logits, dim=-1)
|
| 676 |
+
|
| 677 |
+
# Select top-2 experts
|
| 678 |
+
top_vals, top_idx = topk(scores, k=2)
|
| 679 |
+
|
| 680 |
+
# Normalize selected weights
|
| 681 |
+
weights = top_vals / top_vals.sum(dim=-1, keepdim=True)
|
| 682 |
+
|
| 683 |
+
# Load balancing loss (prevents expert collapse)
|
| 684 |
+
balance_loss = n_experts * (
|
| 685 |
+
fraction_routed_to_each * average_gate_value_for_each
|
| 686 |
+
).sum()
|
| 687 |
+
|
| 688 |
+
return weights, top_idx, balance_loss
|
| 689 |
+
```
|
| 690 |
+
|
| 691 |
+
</details>
|
| 692 |
+
|
| 693 |
+
<details>
|
| 694 |
+
<summary><b>Click to expand: Parameter Breakdown</b></summary>
|
| 695 |
+
|
| 696 |
+
| Component | Parameters | % of Total |
|
| 697 |
+
|:--|:--|:--|
|
| 698 |
+
| Token embeddings | 410M | 2.8% |
|
| 699 |
+
| Attention (QKV + output) × 24 | 1,610M | 10.9% |
|
| 700 |
+
| MoE experts (4 × SwiGLU × 24) | 12,365M | 83.5% |
|
| 701 |
+
| Router weights × 24 | 0.4M | 0.003% |
|
| 702 |
+
| RMSNorm × 49 | 0.4M | 0.003% |
|
| 703 |
+
| Output head | 410M | 2.8% |
|
| 704 |
+
| **Total** | **14,815M** | **100%** |
|
| 705 |
+
| **Active per token (top-2)** | **~7,800M** | **~53%** |
|
| 706 |
+
|
| 707 |
+
</details>
|
| 708 |
+
|
| 709 |
+
---
|
| 710 |
+
|
| 711 |
+
## 📋 Model Card Details
|
| 712 |
+
|
| 713 |
+
| Field | Value |
|
| 714 |
+
|:--|:--|
|
| 715 |
+
| **Model Name** | SentinelBrain-14B-MoE-v0.1 (Sentinel Prime — Frankenstein Edition) |
|
| 716 |
+
| **Type** | Causal Language Model (decoder-only) |
|
| 717 |
+
| **Architecture** | Custom MoE Transformer (from scratch) |
|
| 718 |
+
| **Based On** | Nothing — trained from random initialization |
|
| 719 |
+
| **Training Hardware** | 1× AMD Instinct MI300X VF (192 GB HBM3) |
|
| 720 |
+
| **Training Software** | ROCm 7.0, PyTorch 2.10.0+rocm7.0 |
|
| 721 |
+
| **Training Duration** | ~300 GPU-hours (estimated total) |
|
| 722 |
+
| **Carbon Footprint** | Estimated ~45 kg CO₂ (single GPU, cloud datacenter) |
|
| 723 |
+
| **License** | Apache 2.0 |
|
| 724 |
+
| **Authors** | Mircea Rusu, QubitDev |
|
| 725 |
+
| **Competition** | AMD Developer Hackathon (lablab.ai) |
|
| 726 |
+
|
| 727 |
+
---
|
| 728 |
+
|
| 729 |
+
## 📄 Citation
|
| 730 |
+
|
| 731 |
+
```bibtex
|
| 732 |
+
@misc{sentinelbrain2026,
|
| 733 |
+
title = {SentinelBrain-14B-MoE (Frankenstein Edition): A Consciousness-Monitored Mixture-of-Experts
|
| 734 |
+
Language Model Trained From Scratch on AMD MI300X},
|
| 735 |
+
author = {Mircea Rusu and QubitDev},
|
| 736 |
+
year = {2026},
|
| 737 |
+
url = {https://sentinel.qubitpage.com/whitepaper},
|
| 738 |
+
note = {Trained entirely from scratch on AMD Instinct MI300X
|
| 739 |
+
for the AMD Developer Hackathon}
|
| 740 |
+
}
|
| 741 |
+
```
|
| 742 |
+
|
| 743 |
+
---
|
| 744 |
+
|
| 745 |
+
## 🔗 Links
|
| 746 |
+
|
| 747 |
+
| Resource | URL |
|
| 748 |
+
|:--|:--|
|
| 749 |
+
| 🔴 **Live Dashboard** | [sentinel.qubitpage.com](https://sentinel.qubitpage.com/) |
|
| 750 |
+
| 📄 **Whitepaper** | [sentinel.qubitpage.com/whitepaper](https://sentinel.qubitpage.com/whitepaper) |
|
| 751 |
+
| 🏆 **AMD Hackathon** | [lablab.ai](https://lablab.ai/ai-hackathons/amd-developer) |
|
| 752 |
+
| 🧠 **Φ Monitor** | [sentinel.qubitpage.com/#phi](https://sentinel.qubitpage.com/#phi) |
|
| 753 |
+
|
| 754 |
+
---
|
| 755 |
+
|
| 756 |
+
<div align="center">
|
| 757 |
+
|
| 758 |
+
*Built with ❤️ on AMD MI300X — Every weight trained from scratch*
|
| 759 |
+
|
| 760 |
+
**Sentinel Prime (Frankenstein Edition): Rebuilt From the Inside Out**
|
| 761 |
+
|
| 762 |
+
</div>
|