CLIWorks commited on
Commit
6805d16
·
verified ·
1 Parent(s): 2d6303e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +14 -45
README.md CHANGED
@@ -1,58 +1,27 @@
1
  # SpiderPortal v5
2
 
3
- Recurrent Depth Transformer with MLA attention, Engram conditional memory, and MoE.
4
 
5
  ## Architecture
6
-
7
- - **Dense**: 250M params — 2 prelude + 6 recurrent + 2 coda layers
8
- - **MoE**: 5.3B params — 32 experts, top-2 routing, 1 shared expert per layer
9
- - **MLA**: Multi-Latent Attention (DeepSeek-V2 style, 10.7x KV cache compression)
10
- - **Engram**: Conditional memory at layers 1,4 (hash-based ngram lookup + conv1d gate)
11
- - **LTI Injection** + **ACT Halting** + **LoRA Adapter**
12
- - 32k context (extendable to 256k via YaRN)
13
 
14
  ## Training
15
 
16
- ### Dense (Phase 1)
17
- ```bash
18
- env MICRO_BATCH=42 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
19
- python mythos-fineweb-dense.py
20
  ```
21
-
22
- ### MoE (Phase 2, from dense checkpoint)
23
- ```bash
24
- env MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
25
- TRITON_COMPILE=1 DENSE_CKPT=checkpoints-dense/spiderportal-v5-dense-ep1-step5000.pt \
26
- python mythos-fineweb-moe.py
27
  ```
28
 
29
- ### MoE (from scratch)
30
- ```bash
31
- env MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 CKPT_EVERY=5000 \
32
- TRITON_COMPILE=1 \
33
- python mythos-fineweb-moe.py
34
  ```
35
-
36
- ## VRAM Usage
37
-
38
- | Config | Batch | VRAM | Tok/s |
39
- |--------|:-----:|:----:|:-----:|
40
- | Dense bf16 | 44 | 48.7GB | 42K |
41
- | Dense MXFP8 | 42 | 46.6GB | 40K |
42
- | MoE bf16 + compile | 28 | 40.6GB | 27K |
43
 
44
  ## Dataset
45
-
46
- Tokenized FineWeb-Edu sample-10BT. Format: raw uint32 little-endian tokens.
47
-
48
- - `data/train_tokens.bin` — 7.7B tokens, 29GB
49
- - `data/metadata.json` — tokenization metadata
50
-
51
- ## Requirements
52
-
53
- - Python 3.10+
54
- - PyTorch 2.x with CUDA 12.0+
55
- - `torchtitan` (for MoE routing/experts)
56
- - `torchao` (optional, for MXFP8)
57
- - `transformers`, `datasets`, `loguru`
58
- - `triton`, `numba` (for custom kernels)
 
1
  # SpiderPortal v5
2
 
3
+ Recurrent Depth Transformer with MLA attention, Engram memory, and MoE.
4
 
5
  ## Architecture
6
+ - Dense: 250M params — 2 prelude + 6 recurrent + 2 coda
7
+ - MoE: 5.3B params — 32 experts, top-2, 1 shared expert/layer
8
+ - MLA (DeepSeek-V2 style, 10.7x KV compression)
9
+ - Engram memory @ layers 1,4
10
+ - LTI + ACT + LoRA
 
 
11
 
12
  ## Training
13
 
14
+ ### Dense
 
 
 
15
  ```
16
+ MICRO_BATCH=42 SEQ_LEN=2048 TARGET_TOKENS=12400000000 python mythos-fineweb-dense.py
 
 
 
 
 
17
  ```
18
 
19
+ ### MoE (from dense checkpoint)
20
+ ```
21
+ MICRO_BATCH=28 SEQ_LEN=2048 TARGET_TOKENS=12400000000 TRITON_COMPILE=1 DENSE_CKPT=... python mythos-fineweb-moe.py
 
 
22
  ```
 
 
 
 
 
 
 
 
23
 
24
  ## Dataset
25
+ Tokenized FineWeb-Edu sample-10BT — raw uint32 LE tokens
26
+ - train_tokens.bin: 7.7B tokens, 29GB
27
+ - metadata.json