LuxiaSL commited on
Commit
d264b3c
·
verified ·
1 Parent(s): 6f9f770

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +393 -0
README.md ADDED
@@ -0,0 +1,393 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ tags:
7
+ - research
8
+ - transformer
9
+ - attention-residuals
10
+ - muon-optimizer
11
+ - nca-pretraining
12
+ - geometric-monitoring
13
+ - causal-lm
14
+ datasets:
15
+ - allenai/peS2o
16
+ - open-web-math/open-web-math
17
+ - HuggingFaceTB/finemath
18
+ - bigcode/the-stack
19
+ - deepmind/pg19
20
+ - pile-of-law/pile-of-law
21
+ - OpenAssistant/oasst2
22
+ pipeline_tag: text-generation
23
+ model-index:
24
+ - name: kotodama-108m-base
25
+ results:
26
+ - task:
27
+ type: text-generation
28
+ name: Language Modeling
29
+ dataset:
30
+ type: wikitext
31
+ name: WikiText-2
32
+ metrics:
33
+ - name: Word Perplexity (fc-base)
34
+ type: perplexity
35
+ value: 41.76
36
+ - name: Word Perplexity (bcpt-base)
37
+ type: perplexity
38
+ value: 52.09
39
+ - task:
40
+ type: multiple-choice
41
+ name: ARC-Easy
42
+ dataset:
43
+ type: ai2_arc
44
+ name: ARC-Easy
45
+ metrics:
46
+ - name: Accuracy (fc-base)
47
+ type: accuracy
48
+ value: 0.455
49
+ - name: Accuracy (bcpt-base)
50
+ type: accuracy
51
+ value: 0.445
52
+ ---
53
+
54
+ # Kotodama 108M Base
55
+
56
+ A 108M parameter decoder-only transformer trained as a **proxy model** for validating architectural and optimizer choices before scaling to 3B parameters. This is a research artifact, not a production model.
57
+
58
+ The model combines three techniques not previously studied together at this scale:
59
+
60
+ - **Block Attention Residuals (AttnRes)** -- learned residual connections across transformer blocks that prevent BOS-sink attention collapse and produce 4x gradient uniformity across depth.
61
+ - **NCA pre-pretraining** -- bootstrapping attention circuits using Neural Cellular Automata trajectories before language training, which trains attention patterns (not MLPs) and creates an L14 attractor basin in the representation manifold.
62
+ - **Muon optimizer** -- spectral-norm steepest descent via Newton-Schulz orthogonalization, producing 2-4x higher stable rank than AdamW at matched loss, with Gram-NS optimized coefficients.
63
+
64
+ **Organization:** [aethera-gp](https://huggingface.co/aethera-gp)
65
+ **Training code:** [github.com/aethera-gp/kotodama](https://github.com/aethera-gp/kotodama) (pretraining/)
66
+
67
+ ## Architecture
68
+
69
+ The model uses a Llama-family architecture with QK-norm and Block Attention Residuals.
70
+
71
+ | Parameter | Value |
72
+ |-----------|-------|
73
+ | Parameters | 107.8M (+ 58.4K AttnRes) |
74
+ | Hidden size | 512 |
75
+ | Layers | 28 |
76
+ | Query heads | 4 |
77
+ | KV heads | 2 (GQA ratio 2:1) |
78
+ | Head dim | 128 |
79
+ | Intermediate size (SwiGLU) | 1408 |
80
+ | Vocabulary | 49,152 (SmolLM2 tokenizer) |
81
+ | Max context | 4,096 tokens |
82
+ | Positional encoding | RoPE (theta=500,000) |
83
+ | Normalization | Pre-RMSNorm + QK-norm |
84
+ | Embeddings | Tied input/output |
85
+ | Bias | None |
86
+ | z-loss | 1e-5 |
87
+ | AttnRes block boundaries | [0, 3, 7, 12, 21, 25] (DD-v1) |
88
+
89
+ ### Block Attention Residuals (DD-v1)
90
+
91
+ AttnRes adds per-layer learned pseudo-queries and key norms that create residual connections between block boundaries. The DD-v1 configuration divides the 28-layer network into 6 variable-size blocks at layers [0, 3, 7, 12, 21, 25]. This adds only 58.4K parameters (0.05% overhead) but has substantial effects on training dynamics.
92
+
93
+ Each transformer block stores:
94
+ - `attn_res_query` / `attn_res_norm`: attention sub-block residual
95
+ - `mlp_res_query` / `mlp_res_norm`: MLP sub-block residual
96
+
97
+ A final `final_res_query` / `final_res_norm` aggregates block outputs before the LM head.
98
+
99
+ ### Differences from stock Llama
100
+
101
+ - **QK-norm**: RMSNorm on Q and K projections after linear projection, enabling higher learning rates
102
+ - **z-loss**: LSE-squared regularization preventing logit explosion
103
+ - **Smaller vocab** (49K vs 128K): reduces the Godey gradient bottleneck (~94% destruction at 3072/49K vs ~98% at 3072/128K for the 3B target)
104
+ - **Block AttnRes**: cross-block residual connections (see above)
105
+
106
+ ## Training
107
+
108
+ ### Optimizer Configuration
109
+
110
+ Hybrid Muon + AdamW: Muon handles 2D weight matrices (Q/K/V/O projections, FFN gate/up/down -- ~77% of parameters), AdamW handles everything else (embeddings, norms).
111
+
112
+ | Parameter | Muon (2D weights) | AdamW (embeddings, norms) |
113
+ |-----------|-------------------|---------------------------|
114
+ | Learning rate | 0.02 | 6e-4 |
115
+ | Momentum / betas | 0.95 (Nesterov) | (0.9, 0.95) |
116
+ | Weight decay | 0.01 | 0.1 |
117
+ | NS iterations | 5 (Gram-NS coefficients) | -- |
118
+
119
+ **Schedule:** WSD (Warmup-Stable-Decay). 5,000 step warmup (~6%), stable plateau to 90% of training, cosine decay over final 10%.
120
+
121
+ **Gradient clipping:** 1.0
122
+
123
+ **Precision:** BF16 autocast with FP8 compute (FP32 optimizer states).
124
+
125
+ ### NCA Pre-Pretraining
126
+
127
+ Before language training, attention weights were bootstrapped using NCA (Neural Cellular Automata) pre-pretraining following Han et al. (2026). An NCA checkpoint co-trained with AttnRes DD-v1 (seed-17, 852M tokens) was used as initialization. After NCA, embeddings were reinitialized to the language vocabulary while attention weights, MLPs, and norms from NCA training were preserved (embed-only reinit).
128
+
129
+ ### Data Mix (Fullcorpus)
130
+
131
+ 170.4B tokens from 13 sources, shuffled with seed 42, sequence length 4096.
132
+
133
+ | Source | Tokens | % | Category |
134
+ |--------|--------|---|----------|
135
+ | peS2o | 60.7B | 35.6% | Academic papers (Semantic Scholar) |
136
+ | OpenCoderReasoning | 35.7B | 21.0% | Code reasoning (R1 + QwQ, Python/C++) |
137
+ | Pile of Law | 18.8B | 11.0% | Legal (court opinions, congressional) |
138
+ | StackExchange | 15.7B | 9.2% | Q&A (22 high-value sites) |
139
+ | OpenWebMath | 14.1B | 8.2% | Math web pages |
140
+ | FineMath | 10.8B | 6.4% | Quality-scored math (4+ score) |
141
+ | PG-19 | 7.5B | 4.4% | Books (Project Gutenberg, 71K) |
142
+ | Wikipedia | 5.0B | 3.0% | English Wikipedia |
143
+ | SmolTalk | 0.9B | 0.6% | Synthetic multi-turn dialogue |
144
+ | WildChat | 0.5B | 0.3% | Real user-GPT conversations |
145
+ | SODA | 0.3B | 0.2% | Synthetic social dialogue |
146
+ | Enron | 0.3B | 0.2% | Corporate email |
147
+ | OASST2 | 0.01B | <0.1% | Human multi-turn conversations |
148
+
149
+ **Category breakdown:** Academic/knowledge 38.6%, code reasoning 21.0%, math 14.6%, legal 11.0%, Q&A 9.2%, books 4.4%, conversation 1.1%.
150
+
151
+ ### Hardware and Compute
152
+
153
+ - **Hardware:** 8x NVIDIA B200 (single node, NVLink)
154
+ - **Parallelism:** DDP (DistributedDataParallel)
155
+ - **Throughput:** ~1.96M tokens/sec average
156
+ - **Micro batch size:** 16 per GPU
157
+ - **Global batch size:** 2,097,152 tokens (16 * 4096 * 8 GPUs * gradient accumulation)
158
+ - **torch.compile:** enabled (4x throughput vs eager)
159
+
160
+ ## Model Variants
161
+
162
+ This repository contains two checkpoints from the same model lineage:
163
+
164
+ ### fc-base (fullcorpus)
165
+
166
+ **File:** `fc-base.pt.zst`
167
+
168
+ The primary pretraining run. 170.4B tokens over 81,252 steps on the full 13-source data mix described above. Initialized from NCA+AttnRes checkpoint (seed-17, 852M NCA tokens). WSD schedule with cosine decay in the final 10%.
169
+
170
+ | Metric | Value |
171
+ |--------|-------|
172
+ | Final loss | 2.081 |
173
+ | Min loss | 1.982 (step 80,200) |
174
+ | Final perplexity | 8.01 |
175
+ | Tokens seen | 170.4B |
176
+ | Tokens/param ratio | ~1,581x |
177
+
178
+ ### bcpt-base (books-CPT)
179
+
180
+ **File:** `bcpt-base.pt.zst`
181
+
182
+ Continued pretraining of the fullcorpus model on 36.2B tokens of book data from three Common Pile sources not present in the original data mix. Resumed from fullcorpus step 72,000 (pre-decay, 151B tokens seen) with fresh optimizer state and a new WSD schedule (500-step warmup, 90% stable, 10% cosine decay).
183
+
184
+ | Source | Tokens | % |
185
+ |--------|--------|---|
186
+ | Pre-1929 Books (Internet Archive/HathiTrust) | 19.1B | 52.8% |
187
+ | Library of Congress | 14.0B | 38.7% |
188
+ | DOAB (Open Access Books) | 3.1B | 8.6% |
189
+
190
+ OCR quality filter applied: documents with >5% garbage characters dropped.
191
+
192
+ | Metric | Value |
193
+ |--------|-------|
194
+ | Final loss | 2.342 |
195
+ | Min loss | 2.230 (step 17,260) |
196
+ | Final perplexity | 10.40 |
197
+ | Additional tokens | 36.4B (17,337 steps) |
198
+ | Total tokens seen | ~187.4B (resumed from step 72K / 151B tokens) |
199
+
200
+ The higher loss/perplexity relative to fullcorpus reflects the domain shift to OCR book text, not regression. The books-CPT variant trades general benchmark performance for improved performance on literary and long-form text.
201
+
202
+ ## Evaluation
203
+
204
+ ### LM-Eval Benchmarks
205
+
206
+ All benchmarks run zero-shot via lm-evaluation-harness.
207
+
208
+ | Benchmark | Metric | fc-base | bcpt-base |
209
+ |-----------|--------|---------|-----------|
210
+ | ARC-Easy | acc | 0.455 | 0.445 |
211
+ | ARC-Easy | acc_norm | 0.387 | 0.388 |
212
+ | BoolQ | acc | 0.559 | 0.499 |
213
+ | COPA | acc | 0.590 | 0.590 |
214
+ | HellaSwag | acc | 0.277 | 0.280 |
215
+ | HellaSwag | acc_norm | 0.297 | 0.295 |
216
+ | LAMBADA | acc | 0.281 | 0.297 |
217
+ | LAMBADA | ppl | 83.3 | 85.5 |
218
+ | PIQA | acc | 0.577 | 0.588 |
219
+ | PIQA | acc_norm | 0.569 | 0.571 |
220
+ | SciQ | acc | 0.783 | 0.779 |
221
+ | SciQ | acc_norm | 0.700 | 0.685 |
222
+ | WikiText | word_ppl | 41.76 | 52.09 |
223
+ | WikiText | bits/byte | 1.007 | 1.066 |
224
+ | Winogrande | acc | 0.508 | 0.515 |
225
+
226
+ **Notes:** These are proxy-scale (108M) results. Performance is expected at this scale -- the model was not designed to maximize benchmarks. The books-CPT variant shows slight improvements on commonsense/physical reasoning (PIQA, Winogrande, LAMBADA accuracy) and slight degradation on knowledge-heavy tasks (BoolQ, WikiText perplexity), consistent with the domain shift toward literary text.
227
+
228
+ ## Analysis Highlights
229
+
230
+ The primary value of this model as a research artifact is the geometric monitoring data collected during training. The analysis packages in `fc-analysis/` and `bcpt-analysis/` contain activation geometry, concept geometry, and full metric histories.
231
+
232
+ ### Geometric Health (Final Checkpoint)
233
+
234
+ Monitored at layers [0, 7, 14, 21, 27] throughout training.
235
+
236
+ | Metric | Value | Interpretation |
237
+ |--------|-------|----------------|
238
+ | RankMe (embedding) | 440.5 | High effective dimensionality (out of 512) |
239
+ | RankMe rebound ratio | 15.9x | Strong recovery from early collapse (min 27.7 at step 150) |
240
+ | WeightWatcher alpha | 7.71 | Within Muon-healthy range (see notes) |
241
+ | TwoNN intrinsic dim | 5.76 | Representation manifold dimensionality |
242
+ | Dead units | 0.0% | No dead neurons at any monitored layer |
243
+
244
+ ### Stable Rank Profiles Across Depth
245
+
246
+ Stable rank (effective rank of weight matrices) remains high across all layers throughout training, a signature of Muon's balanced spectral updates. Representative values from the final checkpoint (step 81,225):
247
+
248
+ | Layer | Q proj | K proj | O proj | Gate proj | Down proj |
249
+ |-------|--------|--------|--------|-----------|-----------|
250
+ | 0 | 18.7 | 15.7 | 46.3 | 127.0 | 56.8 |
251
+ | 7 | 42.5 | 40.0 | 87.9 | 76.8 | 140.4 |
252
+ | 14 | 49.1 | 41.5 | 43.1 | 70.2 | 125.0 |
253
+ | 21 | 39.4 | 30.0 | 67.9 | 62.9 | 49.2 |
254
+ | 27 | 43.8 | 32.3 | 115.3 | 76.2 | 127.8 |
255
+
256
+ Key observations:
257
+ - **No low-rank collapse:** All weight matrices maintain high stable rank through 170B tokens. Under AdamW, these values would typically be 2-4x lower.
258
+ - **Depth utilization:** Non-monotonic stable rank profile indicates all layers are actively contributing (not degenerating into near-identity transformations).
259
+ - **Zero dead units:** No layer shows any dead neurons, even after extreme overtraining (1,581x tokens/parameter).
260
+
261
+ ### Attention Entropy Across Depth
262
+
263
+ | Layer | Mean Entropy | Std | Interpretation |
264
+ |-------|-------------|-----|----------------|
265
+ | 0 | 6.13 | 0.43 | Broad attention (early feature mixing) |
266
+ | 7 | 4.64 | 0.77 | Selective attention with variance |
267
+ | 14 | 5.49 | 0.41 | Moderate selectivity |
268
+ | 21 | 5.68 | 0.29 | Moderate, low variance |
269
+ | 27 | 4.14 | 0.79 | Most selective (prediction heads) |
270
+
271
+ This gradient -- broad at the bottom, selective at the top -- is the healthy pattern. Crucially, **the deep layers (L27) maintain diverse attention patterns** (std=0.79) rather than collapsing to BOS-sink. In baseline models without AttnRes, layers 21-27 develop 89-90% BOS attention concentration by this training stage.
272
+
273
+ ### Anisotropy Profile
274
+
275
+ | Layer | Anisotropy |
276
+ |-------|-----------|
277
+ | 0 | 0.066 |
278
+ | 7 | 0.452 |
279
+ | 14 | 0.413 |
280
+ | 21 | 0.148 |
281
+ | 27 | 0.090 |
282
+
283
+ The inverted-U anisotropy profile (low at edges, peaking at middle layers) indicates structured representational geometry rather than isotropy collapse or extreme anisotropy.
284
+
285
+ ### AttnRes Effects (from Proxy Phase Ablations)
286
+
287
+ These findings come from the 5-run optimizer sweep at 6B tokens and the full 170B run:
288
+
289
+ - **BOS-sink prevention:** Baseline models develop 89-90% BOS attention at deep layers by 6B tokens. DD-v1 AttnRes prevents this entirely, maintaining diverse attention patterns at all depths.
290
+ - **4x gradient uniformity:** Gradient norm variance across layers is ~4x lower with AttnRes, enabling more uniform learning across depth.
291
+ - **Full depth utilization:** Without AttnRes, deep layers tend toward near-identity transformations. With AttnRes, stable rank and attention entropy remain diverse at all depths.
292
+ - **DD-v2 fragility:** Shifting even one block boundary (L12 to L14) produced 12/16 geometric metrics outside the range of all other configurations. Variable-size blocks cascade nonlinearly.
293
+
294
+ ### NCA Pre-Pretraining Effects
295
+
296
+ - **Trains attention, not MLPs:** NCA pre-pretraining primarily structures attention weight matrices. MLP weights show minimal structured change, confirming that MLP reinit after NCA is correct.
297
+ - **L14 attractor basin:** NCA creates a distinctive geometric signature at layer 14 that persists through full language training. This basin is present regardless of AttnRes configuration.
298
+ - **Sub-additive with AttnRes:** NCA + AttnRes produces only +0.008 nats over the better of either alone, but preserves geometric properties from both techniques everywhere in the network.
299
+
300
+ ## Key Findings (Proxy Phase)
301
+
302
+ 1. **Muon lr=0.02 is the Pareto optimum** for 108M: matches AdamW final loss while maintaining 2-4x higher stable rank across all weight matrices.
303
+ 2. **torch.compile is the dominant throughput optimization**, providing 4x improvement. Liger kernels without FusedLinearCE hurt compile by 13%.
304
+ 3. **Extreme overtraining (1,581x tokens/param) does not cause geometric collapse** with Muon + AttnRes. Stable rank, attention entropy, and dead unit counts all remain healthy at 170B tokens.
305
+ 4. **WW alpha healthy range is higher for Muon than AdamW.** Alpha values of 7-8 are normal for Muon-trained models; do not apply AdamW-calibrated thresholds (which would flag these as unhealthy).
306
+
307
+ ## Usage
308
+
309
+ The checkpoints are stored as compressed PyTorch state dicts (`.pt.zst`). To load:
310
+
311
+ ```python
312
+ import torch
313
+ import zstandard as zstd
314
+ import io
315
+
316
+ # Decompress
317
+ with open("fc-base.pt.zst", "rb") as f:
318
+ dctx = zstd.ZstdDecompressor()
319
+ decompressed = dctx.decompress(f.read())
320
+
321
+ # Load state dict
322
+ state_dict = torch.load(io.BytesIO(decompressed), map_location="cpu", weights_only=True)
323
+
324
+ # Initialize model (requires the kotodama training code)
325
+ from src.model.llama import LuxiaBaseModel, LuxiaModelConfig
326
+
327
+ config = LuxiaModelConfig(
328
+ hidden_size=512,
329
+ num_layers=28,
330
+ num_attention_heads=4,
331
+ num_kv_heads=2,
332
+ head_dim=128,
333
+ intermediate_size=1408,
334
+ vocab_size=49152,
335
+ max_position_embeddings=4096,
336
+ rope_theta=500000.0,
337
+ qk_norm=True,
338
+ tie_word_embeddings=True,
339
+ z_loss_weight=1e-5,
340
+ attn_res=True,
341
+ attn_res_boundaries=[0, 3, 7, 12, 21, 25],
342
+ )
343
+
344
+ model = LuxiaBaseModel(config)
345
+ model.load_state_dict(state_dict)
346
+ ```
347
+
348
+ **Tokenizer:** `HuggingFaceTB/SmolLM2-135M` (49,152 vocab, byte-fallback).
349
+
350
+ ## Repository Contents
351
+
352
+ ```
353
+ fc-base.pt.zst # Fullcorpus final checkpoint (81,252 steps, 170.4B tokens)
354
+ bcpt-base.pt.zst # Books-CPT checkpoint (17,337 additional steps, 36.4B tokens)
355
+ fc-analysis/ # Fullcorpus analysis package
356
+ activation_geometry/ # Per-layer activation extractions
357
+ concept_geometry/ # Concept-level geometric analysis
358
+ lm_eval/ # Full lm-evaluation-harness results
359
+ report.html # Analysis report
360
+ bcpt-analysis/ # Books-CPT analysis package (same structure)
361
+ fc-metrics.jsonl # Fullcorpus training metrics (loss, LR, throughput)
362
+ fc-geo_metrics.jsonl # Fullcorpus geometric monitoring (stable rank, entropy, etc.)
363
+ bcpt-metrics.jsonl # Books-CPT training metrics
364
+ bcpt-geo_metrics.jsonl # Books-CPT geometric monitoring
365
+ ```
366
+
367
+ ## Limitations
368
+
369
+ - **108M proxy scale.** This model exists to validate architecture and optimizer choices, not to be useful for downstream tasks. Benchmark performance reflects this.
370
+ - **No raw code in training data.** The 645GB cleaned stack_v1 JSONL (~126B tokens, 130 languages) was never tokenized and is absent from the data mix. The model sees code only through reasoning traces (OpenCoderReasoning) and Q&A (StackExchange).
371
+ - **Conversational data < 1.2%.** The original spec targeted 25% conversational data. The actual mix is dominated by academic text (35.6%) and code reasoning (21.0%).
372
+ - **OCR noise in books-CPT.** Despite filtering documents with >5% garbage characters, the books-CPT data (pre-1929 scans, Library of Congress) contains residual OCR artifacts.
373
+ - **No deduplication** was applied to the books-CPT data (estimated minimal cross-source overlap between digitization projects, but not verified).
374
+ - **Eval methodology:** Top-p sampling catastrophically degrades generation quality at 108M scale. All evaluation uses pure temperature sampling only.
375
+
376
+ ## Citation
377
+
378
+ ```bibtex
379
+ @misc{kotodama2026,
380
+ title={Kotodama: Block Attention Residuals and NCA Pre-Pretraining for Transformer Language Models},
381
+ author={Aethera GP},
382
+ year={2026},
383
+ url={https://huggingface.co/aethera-gp/kotodama-108m-base}
384
+ }
385
+ ```
386
+
387
+ ### References
388
+
389
+ - Block Attention Residuals: see `Attention_Residuals.pdf` in the training repo
390
+ - NCA Pre-Pretraining: [Han et al., 2026](https://arxiv.org/abs/2603.10055)
391
+ - Muon Optimizer: [MoonshotAI/Muon](https://github.com/MoonshotAI/Muon); [Moonlight: Muon is Scalable for LLM Training](https://arxiv.org/abs/2502.16982)
392
+ - Gram-Newton-Schulz: [Dao-AILab/Gram-Newton-Schulz](https://github.com/Dao-AILab/Gram-Newton-Schulz)
393
+ - WeightWatcher: [Martin et al.](https://arxiv.org/abs/2102.11258)