Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,393 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: pytorch
|
| 6 |
+
tags:
|
| 7 |
+
- research
|
| 8 |
+
- transformer
|
| 9 |
+
- attention-residuals
|
| 10 |
+
- muon-optimizer
|
| 11 |
+
- nca-pretraining
|
| 12 |
+
- geometric-monitoring
|
| 13 |
+
- causal-lm
|
| 14 |
+
datasets:
|
| 15 |
+
- allenai/peS2o
|
| 16 |
+
- open-web-math/open-web-math
|
| 17 |
+
- HuggingFaceTB/finemath
|
| 18 |
+
- bigcode/the-stack
|
| 19 |
+
- deepmind/pg19
|
| 20 |
+
- pile-of-law/pile-of-law
|
| 21 |
+
- OpenAssistant/oasst2
|
| 22 |
+
pipeline_tag: text-generation
|
| 23 |
+
model-index:
|
| 24 |
+
- name: kotodama-108m-base
|
| 25 |
+
results:
|
| 26 |
+
- task:
|
| 27 |
+
type: text-generation
|
| 28 |
+
name: Language Modeling
|
| 29 |
+
dataset:
|
| 30 |
+
type: wikitext
|
| 31 |
+
name: WikiText-2
|
| 32 |
+
metrics:
|
| 33 |
+
- name: Word Perplexity (fc-base)
|
| 34 |
+
type: perplexity
|
| 35 |
+
value: 41.76
|
| 36 |
+
- name: Word Perplexity (bcpt-base)
|
| 37 |
+
type: perplexity
|
| 38 |
+
value: 52.09
|
| 39 |
+
- task:
|
| 40 |
+
type: multiple-choice
|
| 41 |
+
name: ARC-Easy
|
| 42 |
+
dataset:
|
| 43 |
+
type: ai2_arc
|
| 44 |
+
name: ARC-Easy
|
| 45 |
+
metrics:
|
| 46 |
+
- name: Accuracy (fc-base)
|
| 47 |
+
type: accuracy
|
| 48 |
+
value: 0.455
|
| 49 |
+
- name: Accuracy (bcpt-base)
|
| 50 |
+
type: accuracy
|
| 51 |
+
value: 0.445
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
# Kotodama 108M Base
|
| 55 |
+
|
| 56 |
+
A 108M parameter decoder-only transformer trained as a **proxy model** for validating architectural and optimizer choices before scaling to 3B parameters. This is a research artifact, not a production model.
|
| 57 |
+
|
| 58 |
+
The model combines three techniques not previously studied together at this scale:
|
| 59 |
+
|
| 60 |
+
- **Block Attention Residuals (AttnRes)** -- learned residual connections across transformer blocks that prevent BOS-sink attention collapse and produce 4x gradient uniformity across depth.
|
| 61 |
+
- **NCA pre-pretraining** -- bootstrapping attention circuits using Neural Cellular Automata trajectories before language training, which trains attention patterns (not MLPs) and creates an L14 attractor basin in the representation manifold.
|
| 62 |
+
- **Muon optimizer** -- spectral-norm steepest descent via Newton-Schulz orthogonalization, producing 2-4x higher stable rank than AdamW at matched loss, with Gram-NS optimized coefficients.
|
| 63 |
+
|
| 64 |
+
**Organization:** [aethera-gp](https://huggingface.co/aethera-gp)
|
| 65 |
+
**Training code:** [github.com/aethera-gp/kotodama](https://github.com/aethera-gp/kotodama) (pretraining/)
|
| 66 |
+
|
| 67 |
+
## Architecture
|
| 68 |
+
|
| 69 |
+
The model uses a Llama-family architecture with QK-norm and Block Attention Residuals.
|
| 70 |
+
|
| 71 |
+
| Parameter | Value |
|
| 72 |
+
|-----------|-------|
|
| 73 |
+
| Parameters | 107.8M (+ 58.4K AttnRes) |
|
| 74 |
+
| Hidden size | 512 |
|
| 75 |
+
| Layers | 28 |
|
| 76 |
+
| Query heads | 4 |
|
| 77 |
+
| KV heads | 2 (GQA ratio 2:1) |
|
| 78 |
+
| Head dim | 128 |
|
| 79 |
+
| Intermediate size (SwiGLU) | 1408 |
|
| 80 |
+
| Vocabulary | 49,152 (SmolLM2 tokenizer) |
|
| 81 |
+
| Max context | 4,096 tokens |
|
| 82 |
+
| Positional encoding | RoPE (theta=500,000) |
|
| 83 |
+
| Normalization | Pre-RMSNorm + QK-norm |
|
| 84 |
+
| Embeddings | Tied input/output |
|
| 85 |
+
| Bias | None |
|
| 86 |
+
| z-loss | 1e-5 |
|
| 87 |
+
| AttnRes block boundaries | [0, 3, 7, 12, 21, 25] (DD-v1) |
|
| 88 |
+
|
| 89 |
+
### Block Attention Residuals (DD-v1)
|
| 90 |
+
|
| 91 |
+
AttnRes adds per-layer learned pseudo-queries and key norms that create residual connections between block boundaries. The DD-v1 configuration divides the 28-layer network into 6 variable-size blocks at layers [0, 3, 7, 12, 21, 25]. This adds only 58.4K parameters (0.05% overhead) but has substantial effects on training dynamics.
|
| 92 |
+
|
| 93 |
+
Each transformer block stores:
|
| 94 |
+
- `attn_res_query` / `attn_res_norm`: attention sub-block residual
|
| 95 |
+
- `mlp_res_query` / `mlp_res_norm`: MLP sub-block residual
|
| 96 |
+
|
| 97 |
+
A final `final_res_query` / `final_res_norm` aggregates block outputs before the LM head.
|
| 98 |
+
|
| 99 |
+
### Differences from stock Llama
|
| 100 |
+
|
| 101 |
+
- **QK-norm**: RMSNorm on Q and K projections after linear projection, enabling higher learning rates
|
| 102 |
+
- **z-loss**: LSE-squared regularization preventing logit explosion
|
| 103 |
+
- **Smaller vocab** (49K vs 128K): reduces the Godey gradient bottleneck (~94% destruction at 3072/49K vs ~98% at 3072/128K for the 3B target)
|
| 104 |
+
- **Block AttnRes**: cross-block residual connections (see above)
|
| 105 |
+
|
| 106 |
+
## Training
|
| 107 |
+
|
| 108 |
+
### Optimizer Configuration
|
| 109 |
+
|
| 110 |
+
Hybrid Muon + AdamW: Muon handles 2D weight matrices (Q/K/V/O projections, FFN gate/up/down -- ~77% of parameters), AdamW handles everything else (embeddings, norms).
|
| 111 |
+
|
| 112 |
+
| Parameter | Muon (2D weights) | AdamW (embeddings, norms) |
|
| 113 |
+
|-----------|-------------------|---------------------------|
|
| 114 |
+
| Learning rate | 0.02 | 6e-4 |
|
| 115 |
+
| Momentum / betas | 0.95 (Nesterov) | (0.9, 0.95) |
|
| 116 |
+
| Weight decay | 0.01 | 0.1 |
|
| 117 |
+
| NS iterations | 5 (Gram-NS coefficients) | -- |
|
| 118 |
+
|
| 119 |
+
**Schedule:** WSD (Warmup-Stable-Decay). 5,000 step warmup (~6%), stable plateau to 90% of training, cosine decay over final 10%.
|
| 120 |
+
|
| 121 |
+
**Gradient clipping:** 1.0
|
| 122 |
+
|
| 123 |
+
**Precision:** BF16 autocast with FP8 compute (FP32 optimizer states).
|
| 124 |
+
|
| 125 |
+
### NCA Pre-Pretraining
|
| 126 |
+
|
| 127 |
+
Before language training, attention weights were bootstrapped using NCA (Neural Cellular Automata) pre-pretraining following Han et al. (2026). An NCA checkpoint co-trained with AttnRes DD-v1 (seed-17, 852M tokens) was used as initialization. After NCA, embeddings were reinitialized to the language vocabulary while attention weights, MLPs, and norms from NCA training were preserved (embed-only reinit).
|
| 128 |
+
|
| 129 |
+
### Data Mix (Fullcorpus)
|
| 130 |
+
|
| 131 |
+
170.4B tokens from 13 sources, shuffled with seed 42, sequence length 4096.
|
| 132 |
+
|
| 133 |
+
| Source | Tokens | % | Category |
|
| 134 |
+
|--------|--------|---|----------|
|
| 135 |
+
| peS2o | 60.7B | 35.6% | Academic papers (Semantic Scholar) |
|
| 136 |
+
| OpenCoderReasoning | 35.7B | 21.0% | Code reasoning (R1 + QwQ, Python/C++) |
|
| 137 |
+
| Pile of Law | 18.8B | 11.0% | Legal (court opinions, congressional) |
|
| 138 |
+
| StackExchange | 15.7B | 9.2% | Q&A (22 high-value sites) |
|
| 139 |
+
| OpenWebMath | 14.1B | 8.2% | Math web pages |
|
| 140 |
+
| FineMath | 10.8B | 6.4% | Quality-scored math (4+ score) |
|
| 141 |
+
| PG-19 | 7.5B | 4.4% | Books (Project Gutenberg, 71K) |
|
| 142 |
+
| Wikipedia | 5.0B | 3.0% | English Wikipedia |
|
| 143 |
+
| SmolTalk | 0.9B | 0.6% | Synthetic multi-turn dialogue |
|
| 144 |
+
| WildChat | 0.5B | 0.3% | Real user-GPT conversations |
|
| 145 |
+
| SODA | 0.3B | 0.2% | Synthetic social dialogue |
|
| 146 |
+
| Enron | 0.3B | 0.2% | Corporate email |
|
| 147 |
+
| OASST2 | 0.01B | <0.1% | Human multi-turn conversations |
|
| 148 |
+
|
| 149 |
+
**Category breakdown:** Academic/knowledge 38.6%, code reasoning 21.0%, math 14.6%, legal 11.0%, Q&A 9.2%, books 4.4%, conversation 1.1%.
|
| 150 |
+
|
| 151 |
+
### Hardware and Compute
|
| 152 |
+
|
| 153 |
+
- **Hardware:** 8x NVIDIA B200 (single node, NVLink)
|
| 154 |
+
- **Parallelism:** DDP (DistributedDataParallel)
|
| 155 |
+
- **Throughput:** ~1.96M tokens/sec average
|
| 156 |
+
- **Micro batch size:** 16 per GPU
|
| 157 |
+
- **Global batch size:** 2,097,152 tokens (16 * 4096 * 8 GPUs * gradient accumulation)
|
| 158 |
+
- **torch.compile:** enabled (4x throughput vs eager)
|
| 159 |
+
|
| 160 |
+
## Model Variants
|
| 161 |
+
|
| 162 |
+
This repository contains two checkpoints from the same model lineage:
|
| 163 |
+
|
| 164 |
+
### fc-base (fullcorpus)
|
| 165 |
+
|
| 166 |
+
**File:** `fc-base.pt.zst`
|
| 167 |
+
|
| 168 |
+
The primary pretraining run. 170.4B tokens over 81,252 steps on the full 13-source data mix described above. Initialized from NCA+AttnRes checkpoint (seed-17, 852M NCA tokens). WSD schedule with cosine decay in the final 10%.
|
| 169 |
+
|
| 170 |
+
| Metric | Value |
|
| 171 |
+
|--------|-------|
|
| 172 |
+
| Final loss | 2.081 |
|
| 173 |
+
| Min loss | 1.982 (step 80,200) |
|
| 174 |
+
| Final perplexity | 8.01 |
|
| 175 |
+
| Tokens seen | 170.4B |
|
| 176 |
+
| Tokens/param ratio | ~1,581x |
|
| 177 |
+
|
| 178 |
+
### bcpt-base (books-CPT)
|
| 179 |
+
|
| 180 |
+
**File:** `bcpt-base.pt.zst`
|
| 181 |
+
|
| 182 |
+
Continued pretraining of the fullcorpus model on 36.2B tokens of book data from three Common Pile sources not present in the original data mix. Resumed from fullcorpus step 72,000 (pre-decay, 151B tokens seen) with fresh optimizer state and a new WSD schedule (500-step warmup, 90% stable, 10% cosine decay).
|
| 183 |
+
|
| 184 |
+
| Source | Tokens | % |
|
| 185 |
+
|--------|--------|---|
|
| 186 |
+
| Pre-1929 Books (Internet Archive/HathiTrust) | 19.1B | 52.8% |
|
| 187 |
+
| Library of Congress | 14.0B | 38.7% |
|
| 188 |
+
| DOAB (Open Access Books) | 3.1B | 8.6% |
|
| 189 |
+
|
| 190 |
+
OCR quality filter applied: documents with >5% garbage characters dropped.
|
| 191 |
+
|
| 192 |
+
| Metric | Value |
|
| 193 |
+
|--------|-------|
|
| 194 |
+
| Final loss | 2.342 |
|
| 195 |
+
| Min loss | 2.230 (step 17,260) |
|
| 196 |
+
| Final perplexity | 10.40 |
|
| 197 |
+
| Additional tokens | 36.4B (17,337 steps) |
|
| 198 |
+
| Total tokens seen | ~187.4B (resumed from step 72K / 151B tokens) |
|
| 199 |
+
|
| 200 |
+
The higher loss/perplexity relative to fullcorpus reflects the domain shift to OCR book text, not regression. The books-CPT variant trades general benchmark performance for improved performance on literary and long-form text.
|
| 201 |
+
|
| 202 |
+
## Evaluation
|
| 203 |
+
|
| 204 |
+
### LM-Eval Benchmarks
|
| 205 |
+
|
| 206 |
+
All benchmarks run zero-shot via lm-evaluation-harness.
|
| 207 |
+
|
| 208 |
+
| Benchmark | Metric | fc-base | bcpt-base |
|
| 209 |
+
|-----------|--------|---------|-----------|
|
| 210 |
+
| ARC-Easy | acc | 0.455 | 0.445 |
|
| 211 |
+
| ARC-Easy | acc_norm | 0.387 | 0.388 |
|
| 212 |
+
| BoolQ | acc | 0.559 | 0.499 |
|
| 213 |
+
| COPA | acc | 0.590 | 0.590 |
|
| 214 |
+
| HellaSwag | acc | 0.277 | 0.280 |
|
| 215 |
+
| HellaSwag | acc_norm | 0.297 | 0.295 |
|
| 216 |
+
| LAMBADA | acc | 0.281 | 0.297 |
|
| 217 |
+
| LAMBADA | ppl | 83.3 | 85.5 |
|
| 218 |
+
| PIQA | acc | 0.577 | 0.588 |
|
| 219 |
+
| PIQA | acc_norm | 0.569 | 0.571 |
|
| 220 |
+
| SciQ | acc | 0.783 | 0.779 |
|
| 221 |
+
| SciQ | acc_norm | 0.700 | 0.685 |
|
| 222 |
+
| WikiText | word_ppl | 41.76 | 52.09 |
|
| 223 |
+
| WikiText | bits/byte | 1.007 | 1.066 |
|
| 224 |
+
| Winogrande | acc | 0.508 | 0.515 |
|
| 225 |
+
|
| 226 |
+
**Notes:** These are proxy-scale (108M) results. Performance is expected at this scale -- the model was not designed to maximize benchmarks. The books-CPT variant shows slight improvements on commonsense/physical reasoning (PIQA, Winogrande, LAMBADA accuracy) and slight degradation on knowledge-heavy tasks (BoolQ, WikiText perplexity), consistent with the domain shift toward literary text.
|
| 227 |
+
|
| 228 |
+
## Analysis Highlights
|
| 229 |
+
|
| 230 |
+
The primary value of this model as a research artifact is the geometric monitoring data collected during training. The analysis packages in `fc-analysis/` and `bcpt-analysis/` contain activation geometry, concept geometry, and full metric histories.
|
| 231 |
+
|
| 232 |
+
### Geometric Health (Final Checkpoint)
|
| 233 |
+
|
| 234 |
+
Monitored at layers [0, 7, 14, 21, 27] throughout training.
|
| 235 |
+
|
| 236 |
+
| Metric | Value | Interpretation |
|
| 237 |
+
|--------|-------|----------------|
|
| 238 |
+
| RankMe (embedding) | 440.5 | High effective dimensionality (out of 512) |
|
| 239 |
+
| RankMe rebound ratio | 15.9x | Strong recovery from early collapse (min 27.7 at step 150) |
|
| 240 |
+
| WeightWatcher alpha | 7.71 | Within Muon-healthy range (see notes) |
|
| 241 |
+
| TwoNN intrinsic dim | 5.76 | Representation manifold dimensionality |
|
| 242 |
+
| Dead units | 0.0% | No dead neurons at any monitored layer |
|
| 243 |
+
|
| 244 |
+
### Stable Rank Profiles Across Depth
|
| 245 |
+
|
| 246 |
+
Stable rank (effective rank of weight matrices) remains high across all layers throughout training, a signature of Muon's balanced spectral updates. Representative values from the final checkpoint (step 81,225):
|
| 247 |
+
|
| 248 |
+
| Layer | Q proj | K proj | O proj | Gate proj | Down proj |
|
| 249 |
+
|-------|--------|--------|--------|-----------|-----------|
|
| 250 |
+
| 0 | 18.7 | 15.7 | 46.3 | 127.0 | 56.8 |
|
| 251 |
+
| 7 | 42.5 | 40.0 | 87.9 | 76.8 | 140.4 |
|
| 252 |
+
| 14 | 49.1 | 41.5 | 43.1 | 70.2 | 125.0 |
|
| 253 |
+
| 21 | 39.4 | 30.0 | 67.9 | 62.9 | 49.2 |
|
| 254 |
+
| 27 | 43.8 | 32.3 | 115.3 | 76.2 | 127.8 |
|
| 255 |
+
|
| 256 |
+
Key observations:
|
| 257 |
+
- **No low-rank collapse:** All weight matrices maintain high stable rank through 170B tokens. Under AdamW, these values would typically be 2-4x lower.
|
| 258 |
+
- **Depth utilization:** Non-monotonic stable rank profile indicates all layers are actively contributing (not degenerating into near-identity transformations).
|
| 259 |
+
- **Zero dead units:** No layer shows any dead neurons, even after extreme overtraining (1,581x tokens/parameter).
|
| 260 |
+
|
| 261 |
+
### Attention Entropy Across Depth
|
| 262 |
+
|
| 263 |
+
| Layer | Mean Entropy | Std | Interpretation |
|
| 264 |
+
|-------|-------------|-----|----------------|
|
| 265 |
+
| 0 | 6.13 | 0.43 | Broad attention (early feature mixing) |
|
| 266 |
+
| 7 | 4.64 | 0.77 | Selective attention with variance |
|
| 267 |
+
| 14 | 5.49 | 0.41 | Moderate selectivity |
|
| 268 |
+
| 21 | 5.68 | 0.29 | Moderate, low variance |
|
| 269 |
+
| 27 | 4.14 | 0.79 | Most selective (prediction heads) |
|
| 270 |
+
|
| 271 |
+
This gradient -- broad at the bottom, selective at the top -- is the healthy pattern. Crucially, **the deep layers (L27) maintain diverse attention patterns** (std=0.79) rather than collapsing to BOS-sink. In baseline models without AttnRes, layers 21-27 develop 89-90% BOS attention concentration by this training stage.
|
| 272 |
+
|
| 273 |
+
### Anisotropy Profile
|
| 274 |
+
|
| 275 |
+
| Layer | Anisotropy |
|
| 276 |
+
|-------|-----------|
|
| 277 |
+
| 0 | 0.066 |
|
| 278 |
+
| 7 | 0.452 |
|
| 279 |
+
| 14 | 0.413 |
|
| 280 |
+
| 21 | 0.148 |
|
| 281 |
+
| 27 | 0.090 |
|
| 282 |
+
|
| 283 |
+
The inverted-U anisotropy profile (low at edges, peaking at middle layers) indicates structured representational geometry rather than isotropy collapse or extreme anisotropy.
|
| 284 |
+
|
| 285 |
+
### AttnRes Effects (from Proxy Phase Ablations)
|
| 286 |
+
|
| 287 |
+
These findings come from the 5-run optimizer sweep at 6B tokens and the full 170B run:
|
| 288 |
+
|
| 289 |
+
- **BOS-sink prevention:** Baseline models develop 89-90% BOS attention at deep layers by 6B tokens. DD-v1 AttnRes prevents this entirely, maintaining diverse attention patterns at all depths.
|
| 290 |
+
- **4x gradient uniformity:** Gradient norm variance across layers is ~4x lower with AttnRes, enabling more uniform learning across depth.
|
| 291 |
+
- **Full depth utilization:** Without AttnRes, deep layers tend toward near-identity transformations. With AttnRes, stable rank and attention entropy remain diverse at all depths.
|
| 292 |
+
- **DD-v2 fragility:** Shifting even one block boundary (L12 to L14) produced 12/16 geometric metrics outside the range of all other configurations. Variable-size blocks cascade nonlinearly.
|
| 293 |
+
|
| 294 |
+
### NCA Pre-Pretraining Effects
|
| 295 |
+
|
| 296 |
+
- **Trains attention, not MLPs:** NCA pre-pretraining primarily structures attention weight matrices. MLP weights show minimal structured change, confirming that MLP reinit after NCA is correct.
|
| 297 |
+
- **L14 attractor basin:** NCA creates a distinctive geometric signature at layer 14 that persists through full language training. This basin is present regardless of AttnRes configuration.
|
| 298 |
+
- **Sub-additive with AttnRes:** NCA + AttnRes produces only +0.008 nats over the better of either alone, but preserves geometric properties from both techniques everywhere in the network.
|
| 299 |
+
|
| 300 |
+
## Key Findings (Proxy Phase)
|
| 301 |
+
|
| 302 |
+
1. **Muon lr=0.02 is the Pareto optimum** for 108M: matches AdamW final loss while maintaining 2-4x higher stable rank across all weight matrices.
|
| 303 |
+
2. **torch.compile is the dominant throughput optimization**, providing 4x improvement. Liger kernels without FusedLinearCE hurt compile by 13%.
|
| 304 |
+
3. **Extreme overtraining (1,581x tokens/param) does not cause geometric collapse** with Muon + AttnRes. Stable rank, attention entropy, and dead unit counts all remain healthy at 170B tokens.
|
| 305 |
+
4. **WW alpha healthy range is higher for Muon than AdamW.** Alpha values of 7-8 are normal for Muon-trained models; do not apply AdamW-calibrated thresholds (which would flag these as unhealthy).
|
| 306 |
+
|
| 307 |
+
## Usage
|
| 308 |
+
|
| 309 |
+
The checkpoints are stored as compressed PyTorch state dicts (`.pt.zst`). To load:
|
| 310 |
+
|
| 311 |
+
```python
|
| 312 |
+
import torch
|
| 313 |
+
import zstandard as zstd
|
| 314 |
+
import io
|
| 315 |
+
|
| 316 |
+
# Decompress
|
| 317 |
+
with open("fc-base.pt.zst", "rb") as f:
|
| 318 |
+
dctx = zstd.ZstdDecompressor()
|
| 319 |
+
decompressed = dctx.decompress(f.read())
|
| 320 |
+
|
| 321 |
+
# Load state dict
|
| 322 |
+
state_dict = torch.load(io.BytesIO(decompressed), map_location="cpu", weights_only=True)
|
| 323 |
+
|
| 324 |
+
# Initialize model (requires the kotodama training code)
|
| 325 |
+
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig
|
| 326 |
+
|
| 327 |
+
config = LuxiaModelConfig(
|
| 328 |
+
hidden_size=512,
|
| 329 |
+
num_layers=28,
|
| 330 |
+
num_attention_heads=4,
|
| 331 |
+
num_kv_heads=2,
|
| 332 |
+
head_dim=128,
|
| 333 |
+
intermediate_size=1408,
|
| 334 |
+
vocab_size=49152,
|
| 335 |
+
max_position_embeddings=4096,
|
| 336 |
+
rope_theta=500000.0,
|
| 337 |
+
qk_norm=True,
|
| 338 |
+
tie_word_embeddings=True,
|
| 339 |
+
z_loss_weight=1e-5,
|
| 340 |
+
attn_res=True,
|
| 341 |
+
attn_res_boundaries=[0, 3, 7, 12, 21, 25],
|
| 342 |
+
)
|
| 343 |
+
|
| 344 |
+
model = LuxiaBaseModel(config)
|
| 345 |
+
model.load_state_dict(state_dict)
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
**Tokenizer:** `HuggingFaceTB/SmolLM2-135M` (49,152 vocab, byte-fallback).
|
| 349 |
+
|
| 350 |
+
## Repository Contents
|
| 351 |
+
|
| 352 |
+
```
|
| 353 |
+
fc-base.pt.zst # Fullcorpus final checkpoint (81,252 steps, 170.4B tokens)
|
| 354 |
+
bcpt-base.pt.zst # Books-CPT checkpoint (17,337 additional steps, 36.4B tokens)
|
| 355 |
+
fc-analysis/ # Fullcorpus analysis package
|
| 356 |
+
activation_geometry/ # Per-layer activation extractions
|
| 357 |
+
concept_geometry/ # Concept-level geometric analysis
|
| 358 |
+
lm_eval/ # Full lm-evaluation-harness results
|
| 359 |
+
report.html # Analysis report
|
| 360 |
+
bcpt-analysis/ # Books-CPT analysis package (same structure)
|
| 361 |
+
fc-metrics.jsonl # Fullcorpus training metrics (loss, LR, throughput)
|
| 362 |
+
fc-geo_metrics.jsonl # Fullcorpus geometric monitoring (stable rank, entropy, etc.)
|
| 363 |
+
bcpt-metrics.jsonl # Books-CPT training metrics
|
| 364 |
+
bcpt-geo_metrics.jsonl # Books-CPT geometric monitoring
|
| 365 |
+
```
|
| 366 |
+
|
| 367 |
+
## Limitations
|
| 368 |
+
|
| 369 |
+
- **108M proxy scale.** This model exists to validate architecture and optimizer choices, not to be useful for downstream tasks. Benchmark performance reflects this.
|
| 370 |
+
- **No raw code in training data.** The 645GB cleaned stack_v1 JSONL (~126B tokens, 130 languages) was never tokenized and is absent from the data mix. The model sees code only through reasoning traces (OpenCoderReasoning) and Q&A (StackExchange).
|
| 371 |
+
- **Conversational data < 1.2%.** The original spec targeted 25% conversational data. The actual mix is dominated by academic text (35.6%) and code reasoning (21.0%).
|
| 372 |
+
- **OCR noise in books-CPT.** Despite filtering documents with >5% garbage characters, the books-CPT data (pre-1929 scans, Library of Congress) contains residual OCR artifacts.
|
| 373 |
+
- **No deduplication** was applied to the books-CPT data (estimated minimal cross-source overlap between digitization projects, but not verified).
|
| 374 |
+
- **Eval methodology:** Top-p sampling catastrophically degrades generation quality at 108M scale. All evaluation uses pure temperature sampling only.
|
| 375 |
+
|
| 376 |
+
## Citation
|
| 377 |
+
|
| 378 |
+
```bibtex
|
| 379 |
+
@misc{kotodama2026,
|
| 380 |
+
title={Kotodama: Block Attention Residuals and NCA Pre-Pretraining for Transformer Language Models},
|
| 381 |
+
author={Aethera GP},
|
| 382 |
+
year={2026},
|
| 383 |
+
url={https://huggingface.co/aethera-gp/kotodama-108m-base}
|
| 384 |
+
}
|
| 385 |
+
```
|
| 386 |
+
|
| 387 |
+
### References
|
| 388 |
+
|
| 389 |
+
- Block Attention Residuals: see `Attention_Residuals.pdf` in the training repo
|
| 390 |
+
- NCA Pre-Pretraining: [Han et al., 2026](https://arxiv.org/abs/2603.10055)
|
| 391 |
+
- Muon Optimizer: [MoonshotAI/Muon](https://github.com/MoonshotAI/Muon); [Moonlight: Muon is Scalable for LLM Training](https://arxiv.org/abs/2502.16982)
|
| 392 |
+
- Gram-Newton-Schulz: [Dao-AILab/Gram-Newton-Schulz](https://github.com/Dao-AILab/Gram-Newton-Schulz)
|
| 393 |
+
- WeightWatcher: [Martin et al.](https://arxiv.org/abs/2102.11258)
|