rtferraz commited on
Commit
c382430
Β·
verified Β·
1 Parent(s): 5ef6515

Add analysis document

Browse files
Files changed (1) hide show
  1. ANALYSIS.md +141 -0
ANALYSIS.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Parameter Golf β€” Competitive Analysis & Implementation Plan
2
+
3
+ ## Executive Summary
4
+
5
+ Your original BitNet b1.58 submission has strong fundamentals (depth recurrence, Muon optimizer, sliding window eval) but is missing **5 critical techniques** used by the top entries. I've built an improved script incorporating all of them. Here's the path to the top of the leaderboard.
6
+
7
+ ## Gap Analysis: Your Submission vs. Top (1.0810 BPB)
8
+
9
+ | Technique | Your Code | Top Entries | Expected BPB Impact |
10
+ |---|---|---|---|
11
+ | **Tokenizer** | SP1024 (1024 vocab) | SP8192 (8192 vocab) | **~5-8% BPB improvement** |
12
+ | **Quantization** | Ternary QAT (BitNet) | Int6 GPTQ + SDClip | **~2-3% better quality** |
13
+ | **Architecture** | Serial residual | Parallel residual (PAF) | **~1-2% BPB** |
14
+ | **TTT** | None | Score-first TTT at eval | **~1-3% BPB** |
15
+ | **QK-Gain** | 1.5 | 5.0-5.25 | **~0.5-1%** |
16
+ | **Weight Decay** | 0.02 | 0.09 | **~0.3-0.5%** |
17
+ | **Warmdown** | 1200 steps | 3500 steps | **~0.2-0.5%** |
18
+ | **EMA** | None | EMA (decay 0.999) | **~0.3-0.5%** |
19
+ | **Depth Recurrence** | 4 unique Γ— 6 loops βœ… | 3 unique Γ— 8 loops | Similar |
20
+ | **Residual mixing** | x0 anchor βœ… | x0 anchor βœ… | Same |
21
+
22
+ **Estimated total improvement: Your ~1.15-1.20 β†’ 1.08-1.10 BPB**
23
+
24
+ ## Key Technique Explanations
25
+
26
+ ### 1. SP8192 Vocabulary (Biggest Single Win)
27
+
28
+ The BPB metric is **bits per byte** β€” not per token. With 1024 vocab tokens, each token covers fewer characters, so the model needs more tokens to represent the same text. With 8192 tokens, each token covers more text on average, and the model gets "credit" for compressing more bytes per correct prediction.
29
+
30
+ The top entries all use SP8192 because:
31
+ - 8Γ— more vocab = better text coverage = lower BPB
32
+ - The embedding table (8192 Γ— 768 = 6.3M params) still fits in budget with int6/int8 quantization
33
+ - Domain-tuned on FineWeb data for optimal subword splits
34
+
35
+ **Action**: Use the SP8192 tokenizer and matching data shards provided in the competition repo.
36
+
37
+ ### 2. Int6 GPTQ + SDClip (vs. Your BitNet Ternary)
38
+
39
+ Your ternary QAT approach is creative but suboptimal here because:
40
+ - **Ternary β‰ˆ 1.58 bits/param** but int6 = 6 bits/param β†’ each parameter carries 4Γ— more information
41
+ - In a 16MB budget, int6 fits ~29M params vs ternary fitting ~64M params
42
+ - BUT: 29M params at int6 quality > 64M params at ternary quality for language modeling
43
+ - The effective "information per parameter" is much higher with int6
44
+
45
+ **SDClip** (std-based clipping): Before quantizing, clip each row's values to `mean Β± 2.5*std`. This removes outliers that would otherwise dominate the quantization grid, dramatically reducing quantization error.
46
+
47
+ ### 3. Parallel Residuals (PAF Architecture)
48
+
49
+ Standard transformer:
50
+ ```python
51
+ x = x + attn(norm1(x)) # step 1
52
+ x = x + mlp(norm2(x)) # step 2 (uses updated x)
53
+ ```
54
+
55
+ Parallel (GPT-J/PaLM style):
56
+ ```python
57
+ h = norm(x) # single norm
58
+ x = x + attn(h) + mlp(h) # both use same input
59
+ ```
60
+
61
+ Benefits: saves one norm (fewer params), both branches see the same input (wider information flow), empirically ~1-2% better BPB at small scale.
62
+
63
+ ### 4. Score-First TTT (Test-Time Training)
64
+
65
+ At evaluation time only (free compute!):
66
+ 1. Process tokens in chunks of 64
67
+ 2. For each chunk: **score first** (compute loss with current weights)
68
+ 3. Then **update**: gradient step on MLP.proj weights using reconstruction loss
69
+ 4. Next chunk benefits from the updated weights
70
+
71
+ This is "legal" because it's strictly causal β€” predictions for chunk i only depend on chunks 0..i-1. The competition allows arbitrary test-time compute.
72
+
73
+ ### 5. Higher QK-Gain (5.25 vs 1.5)
74
+
75
+ At small model dimensions (768), the QK dot products are too small to create sharp attention patterns. QK-Gain multiplies the queries by a learned scalar, effectively controlling the "temperature" of attention. 5.25 is the empirically optimal value found by the top entries.
76
+
77
+ ## Architecture Config
78
+
79
+ ```
80
+ Vocab: 8192 (SP8192 BPE)
81
+ Model dim: 768
82
+ Heads: 12 (QKV)
83
+ KV heads: 4 (GQA)
84
+ MLP multiplier: 4Γ— (hidden = 3072)
85
+ Unique layers: 3
86
+ Train recurrence: 8 (24 effective layers)
87
+ Eval recurrence: 16 (48 effective layers)
88
+ QK-Gain: 5.25
89
+ Logit softcap: 30.0
90
+ RoPE base: 10000
91
+
92
+ Unique params: ~25.2M
93
+ Compressed: ~13-14 MB (int6 + zlib)
94
+ ```
95
+
96
+ ## Hyperparameter Choices
97
+
98
+ | Param | Value | Rationale |
99
+ |---|---|---|
100
+ | Matrix LR | 0.04 | Muon with NS5 orthogonalization |
101
+ | Embed LR | 0.05 | Adam for embeddings |
102
+ | Scalar LR | 0.04 | Adam for norms/scales |
103
+ | Weight Decay | 0.09 | High WD regularizes small models (PR #1285 showed 0.09 > 0.04) |
104
+ | Warmdown | 3500 steps | Longer warmdown preserves learned representations (PR #374) |
105
+ | EMA start | 40% through training | Only average later checkpoints |
106
+ | EMA decay | 0.999 | Standard for small models |
107
+ | TTT LR | 0.01 | Inner loop learning rate for test-time adaptation |
108
+ | TTT chunk | 64 | Score-first TTT chunk size |
109
+ | SDClip n_std | 2.5 | Standard deviation clipping range |
110
+
111
+ ## How to Run
112
+
113
+ ```bash
114
+ # On 8Γ—H100 (competition standard):
115
+ torchrun --standalone --nproc_per_node=8 train_final.py
116
+
117
+ # Override any hyperparameter via env vars:
118
+ V=8192 D=768 NUL=3 NR=8 QKG=5.25 MWD=0.09 torchrun --standalone --nproc_per_node=8 train_final.py
119
+
120
+ # Use SP8192 data (must match tokenizer):
121
+ DATA_PATH=./data/datasets/fineweb10B_sp8192 TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model torchrun ...
122
+ ```
123
+
124
+ ## Next Steps to Push Further
125
+
126
+ ### Immediate (before submission)
127
+ 1. **Verify SP8192 data availability** in the competition repo
128
+ 2. **Run on 8Γ—H100** β€” the 10-minute wall clock starts here
129
+ 3. **Tune TTT LR** β€” sweep [0.001, 0.005, 0.01, 0.02, 0.05] on val set
130
+
131
+ ### If time permits (iterative improvements)
132
+ 4. **Self-generated GPTQ calibration data**: Generate text from the trained model, use as calibration data for GPTQ quantization (PR #1019 technique)
133
+ 5. **XSA (Cross-Sequence Attention)**: On the last 3-4 layers, attend across sequence boundaries in the sliding window β€” effectively increasing context length
134
+ 6. **Progressive recurrence**: Start with fewer recurrences and increase during training β€” warm up the depth gradually
135
+ 7. **Hessian-aware SDClip**: Use actual Hessian diagonal (from Fisher information) to set per-row clip ranges, instead of simple std-based clipping
136
+ 8. **BigramHash embeddings**: Hash bigrams to augment the embedding table β€” more input information for free
137
+
138
+ ### Longer-term experiments
139
+ 9. **Increase vocab to larger**: If budget allows after int6 compression, try SP16384
140
+ 10. **Mixed quantization**: Int4 for some layers, int6 for critical ones (first/last layers)
141
+ 11. **Depth-conditional scaling**: Different attn_scale/mlp_scale for each recurrence step (not just each unique layer)