CLIWorks commited on
Commit
2d6303e
·
verified ·
1 Parent(s): c35e255

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SpiderPortal v5
2
+
3
+ Recurrent Depth Transformer with MLA attention, Engram conditional memory, and MoE.
4
+
5
+ ## Architecture
6
+
7
+ - **Dense**: 250M params — 2 prelude + 6 recurrent + 2 coda layers
8
+ - **MoE**: 5.3B params — 32 experts, top-2 routing, 1 shared expert per layer
9
+ - **MLA**: Multi-Latent Attention (DeepSeek-V2 style, 10.7x KV cache compression)
10
+ - **Engram**: Conditional memory at layers 1,4 (hash-based ngram lookup + conv1d gate)
11
+ - **LTI Injection** + **ACT Halting** + **LoRA Adapter**
12
+ - 32k context (extendable to 256k via YaRN)
13
+
14
+ ## Training
15
+
16
+ ### Dense (Phase 1)
17
+ ```bash
18
+ env MICRO_BATCH=42 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
19
+ python mythos-fineweb-dense.py
20
+ ```
21
+
22
+ ### MoE (Phase 2, from dense checkpoint)
23
+ ```bash
24
+ env MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
25
+ TRITON_COMPILE=1 DENSE_CKPT=checkpoints-dense/spiderportal-v5-dense-ep1-step5000.pt \
26
+ python mythos-fineweb-moe.py
27
+ ```
28
+
29
+ ### MoE (from scratch)
30
+ ```bash
31
+ env MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
32
+ TRITON_COMPILE=1 \
33
+ python mythos-fineweb-moe.py
34
+ ```
35
+
36
+ ## VRAM Usage
37
+
38
+ | Config | Batch | VRAM | Tok/s |
39
+ |--------|:-----:|:----:|:-----:|
40
+ | Dense bf16 | 44 | 48.7GB | 42K |
41
+ | Dense MXFP8 | 42 | 46.6GB | 40K |
42
+ | MoE bf16 + compile | 28 | 40.6GB | 27K |
43
+
44
+ ## Dataset
45
+
46
+ Tokenized FineWeb-Edu sample-10BT. Format: raw uint32 little-endian tokens.
47
+
48
+ - `data/train_tokens.bin` — 7.7B tokens, 29GB
49
+ - `data/metadata.json` — tokenization metadata
50
+
51
+ ## Requirements
52
+
53
+ - Python 3.10+
54
+ - PyTorch 2.x with CUDA 12.0+
55
+ - `torchtitan` (for MoE routing/experts)
56
+ - `torchao` (optional, for MXFP8)
57
+ - `transformers`, `datasets`, `loguru`
58
+ - `triton`, `numba` (for custom kernels)