CLIWorks
/

spiderportal-v5

Model card Files Files and versions

xet

Community

CLIWorks commited on 22 days ago

Commit

6805d16

verified ·

1 Parent(s): 2d6303e

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +14 -45

README.md CHANGED Viewed

@@ -1,58 +1,27 @@
 # SpiderPortal v5
-Recurrent Depth Transformer with MLA attention, Engram conditional memory, and MoE.
 ## Architecture
-- **Dense**: 250M params — 2 prelude + 6 recurrent + 2 coda layers
-- **MoE**: 5.3B params — 32 experts, top-2 routing, 1 shared expert per layer
-- **MLA**: Multi-Latent Attention (DeepSeek-V2 style, 10.7x KV cache compression)
-- **Engram**: Conditional memory at layers 1,4 (hash-based ngram lookup + conv1d gate)
-- **LTI Injection** + **ACT Halting** + **LoRA Adapter**
-- 32k context (extendable to 256k via YaRN)
 ## Training
-### Dense (Phase 1)
-```bash
-env MICRO_BATCH=42 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
-    python mythos-fineweb-dense.py
 ```
-### MoE (Phase 2, from dense checkpoint)
-```bash
-env MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
-    TRITON_COMPILE=1 DENSE_CKPT=checkpoints-dense/spiderportal-v5-dense-ep1-step5000.pt \
-    python mythos-fineweb-moe.py
 ```
-### MoE (from scratch)
-```bash
-env MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
-    TRITON_COMPILE=1 \
-    python mythos-fineweb-moe.py
 ```
-## VRAM Usage
-| Config | Batch | VRAM | Tok/s |
-|--------|:-----:|:----:|:-----:|
-| Dense bf16 | 44 | 48.7GB | 42K |
-| Dense MXFP8 | 42 | 46.6GB | 40K |
-| MoE bf16 + compile | 28 | 40.6GB | 27K |
 ## Dataset
-Tokenized FineWeb-Edu sample-10BT. Format: raw uint32 little-endian tokens.
-- `data/train_tokens.bin` — 7.7B tokens, 29GB
-- `data/metadata.json` — tokenization metadata
-## Requirements
-- Python 3.10+
-- PyTorch 2.x with CUDA 12.0+
-- `torchtitan` (for MoE routing/experts)
-- `torchao` (optional, for MXFP8)
-- `transformers`, `datasets`, `loguru`
-- `triton`, `numba` (for custom kernels)

 # SpiderPortal v5
+Recurrent Depth Transformer with MLA attention, Engram memory, and MoE.
 ## Architecture
+- Dense: 250M params — 2 prelude + 6 recurrent + 2 coda
+- MoE: 5.3B params — 32 experts, top-2, 1 shared expert/layer
+- MLA (DeepSeek-V2 style, 10.7x KV compression)
+- Engram memory @ layers 1,4
+- LTI + ACT + LoRA
 ## Training
+### Dense
 ```
+MICRO_BATCH=42 SEQ_LEN=2048 TARGET_TOKENS=12400000000 python mythos-fineweb-dense.py
 ```
+### MoE (from dense checkpoint)
+```
+MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 TRITON_COMPILE=1 DENSE_CKPT=... python mythos-fineweb-moe.py
 ```
 ## Dataset
+Tokenized FineWeb-Edu sample-10BT — raw uint32 LE tokens
+- train_tokens.bin: 7.7B tokens, 29GB
+- metadata.json