| # SpiderPortal v5 | |
| Recurrent Depth Transformer with MLA attention, Engram memory, and MoE. | |
| ## Architecture | |
| - Dense: 250M params — 2 prelude + 6 recurrent + 2 coda | |
| - MoE: 5.3B params — 32 experts, top-2, 1 shared expert/layer | |
| - MLA (DeepSeek-V2 style, 10.7x KV compression) | |
| - Engram memory @ layers 1,4 | |
| - LTI + ACT + LoRA | |
| ## Training | |
| ### Dense | |
| ``` | |
| MICRO_BATCH=42 SEQ_LEN=2048 TARGET_TOKENS=12400000000 python mythos-fineweb-dense.py | |
| ``` | |
| ### MoE (from dense checkpoint) | |
| ``` | |
| MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 TRITON_COMPILE=1 DENSE_CKPT=... python mythos-fineweb-moe.py | |
| ``` | |
| ## Dataset | |
| Tokenized FineWeb-Edu sample-10BT — raw uint32 LE tokens | |
| - train_tokens.bin: 7.7B tokens, 29GB | |
| - metadata.json | |
| ## Current Training (1B MoE) | |
| Config: 16 experts | top-1 routing | intermediate=1024 | 6 layers | n_loops=1 | |
| Params: 997M (18% Engram / 82% MoE) | |
| VRAM: 43GB | Throughput: 40K tok/s | |
| ### Run | |