File size: 18,096 Bytes
d264b3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0b18a0
d264b3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
---
license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - research
  - transformer
  - attention-residuals
  - muon-optimizer
  - nca-pretraining
  - geometric-monitoring
  - causal-lm
datasets:
  - allenai/peS2o
  - open-web-math/open-web-math
  - HuggingFaceTB/finemath
  - bigcode/the-stack
  - deepmind/pg19
  - pile-of-law/pile-of-law
  - OpenAssistant/oasst2
pipeline_tag: text-generation
model-index:
  - name: kotodama-108m-base
    results:
      - task:
          type: text-generation
          name: Language Modeling
        dataset:
          type: wikitext
          name: WikiText-2
        metrics:
          - name: Word Perplexity (fc-base)
            type: perplexity
            value: 41.76
          - name: Word Perplexity (bcpt-base)
            type: perplexity
            value: 52.09
      - task:
          type: multiple-choice
          name: ARC-Easy
        dataset:
          type: ai2_arc
          name: ARC-Easy
        metrics:
          - name: Accuracy (fc-base)
            type: accuracy
            value: 0.455
          - name: Accuracy (bcpt-base)
            type: accuracy
            value: 0.445
---

# Kotodama 108M Base

A 108M parameter decoder-only transformer trained as a **proxy model** for validating architectural and optimizer choices before scaling to 3B parameters. This is a research artifact, not a production model.

The model combines three techniques not previously studied together at this scale:

- **Block Attention Residuals (AttnRes)** -- learned residual connections across transformer blocks that prevent BOS-sink attention collapse and produce 4x gradient uniformity across depth.
- **NCA pre-pretraining** -- bootstrapping attention circuits using Neural Cellular Automata trajectories before language training, which trains attention patterns (not MLPs) and creates an L14 attractor basin in the representation manifold.
- **Muon optimizer** -- spectral-norm steepest descent via Newton-Schulz orthogonalization, producing 2-4x higher stable rank than AdamW at matched loss, with Gram-NS optimized coefficients.

**Organization:** [aethera-gp](https://huggingface.co/aethera-gp)
**Training code:** [github.com/LuxiaSL/kotodama](https://github.com/LuxiaSL/kotodama)

## Architecture

The model uses a Llama-family architecture with QK-norm and Block Attention Residuals.

| Parameter | Value |
|-----------|-------|
| Parameters | 107.8M (+ 58.4K AttnRes) |
| Hidden size | 512 |
| Layers | 28 |
| Query heads | 4 |
| KV heads | 2 (GQA ratio 2:1) |
| Head dim | 128 |
| Intermediate size (SwiGLU) | 1408 |
| Vocabulary | 49,152 (SmolLM2 tokenizer) |
| Max context | 4,096 tokens |
| Positional encoding | RoPE (theta=500,000) |
| Normalization | Pre-RMSNorm + QK-norm |
| Embeddings | Tied input/output |
| Bias | None |
| z-loss | 1e-5 |
| AttnRes block boundaries | [0, 3, 7, 12, 21, 25] (DD-v1) |

### Block Attention Residuals (DD-v1)

AttnRes adds per-layer learned pseudo-queries and key norms that create residual connections between block boundaries. The DD-v1 configuration divides the 28-layer network into 6 variable-size blocks at layers [0, 3, 7, 12, 21, 25]. This adds only 58.4K parameters (0.05% overhead) but has substantial effects on training dynamics.

Each transformer block stores:
- `attn_res_query` / `attn_res_norm`: attention sub-block residual
- `mlp_res_query` / `mlp_res_norm`: MLP sub-block residual

A final `final_res_query` / `final_res_norm` aggregates block outputs before the LM head.

### Differences from stock Llama

- **QK-norm**: RMSNorm on Q and K projections after linear projection, enabling higher learning rates
- **z-loss**: LSE-squared regularization preventing logit explosion
- **Smaller vocab** (49K vs 128K): reduces the Godey gradient bottleneck (~94% destruction at 3072/49K vs ~98% at 3072/128K for the 3B target)
- **Block AttnRes**: cross-block residual connections (see above)

## Training

### Optimizer Configuration

Hybrid Muon + AdamW: Muon handles 2D weight matrices (Q/K/V/O projections, FFN gate/up/down -- ~77% of parameters), AdamW handles everything else (embeddings, norms).

| Parameter | Muon (2D weights) | AdamW (embeddings, norms) |
|-----------|-------------------|---------------------------|
| Learning rate | 0.02 | 6e-4 |
| Momentum / betas | 0.95 (Nesterov) | (0.9, 0.95) |
| Weight decay | 0.01 | 0.1 |
| NS iterations | 5 (Gram-NS coefficients) | -- |

**Schedule:** WSD (Warmup-Stable-Decay). 5,000 step warmup (~6%), stable plateau to 90% of training, cosine decay over final 10%.

**Gradient clipping:** 1.0

**Precision:** BF16 autocast with FP8 compute (FP32 optimizer states).

### NCA Pre-Pretraining

Before language training, attention weights were bootstrapped using NCA (Neural Cellular Automata) pre-pretraining following Han et al. (2026). An NCA checkpoint co-trained with AttnRes DD-v1 (seed-17, 852M tokens) was used as initialization. After NCA, embeddings were reinitialized to the language vocabulary while attention weights, MLPs, and norms from NCA training were preserved (embed-only reinit).

### Data Mix (Fullcorpus)

170.4B tokens from 13 sources, shuffled with seed 42, sequence length 4096.

| Source | Tokens | % | Category |
|--------|--------|---|----------|
| peS2o | 60.7B | 35.6% | Academic papers (Semantic Scholar) |
| OpenCoderReasoning | 35.7B | 21.0% | Code reasoning (R1 + QwQ, Python/C++) |
| Pile of Law | 18.8B | 11.0% | Legal (court opinions, congressional) |
| StackExchange | 15.7B | 9.2% | Q&A (22 high-value sites) |
| OpenWebMath | 14.1B | 8.2% | Math web pages |
| FineMath | 10.8B | 6.4% | Quality-scored math (4+ score) |
| PG-19 | 7.5B | 4.4% | Books (Project Gutenberg, 71K) |
| Wikipedia | 5.0B | 3.0% | English Wikipedia |
| SmolTalk | 0.9B | 0.6% | Synthetic multi-turn dialogue |
| WildChat | 0.5B | 0.3% | Real user-GPT conversations |
| SODA | 0.3B | 0.2% | Synthetic social dialogue |
| Enron | 0.3B | 0.2% | Corporate email |
| OASST2 | 0.01B | <0.1% | Human multi-turn conversations |

**Category breakdown:** Academic/knowledge 38.6%, code reasoning 21.0%, math 14.6%, legal 11.0%, Q&A 9.2%, books 4.4%, conversation 1.1%.

### Hardware and Compute

- **Hardware:** 8x NVIDIA B200 (single node, NVLink)
- **Parallelism:** DDP (DistributedDataParallel)
- **Throughput:** ~1.96M tokens/sec average
- **Micro batch size:** 16 per GPU
- **Global batch size:** 2,097,152 tokens (16 * 4096 * 8 GPUs * gradient accumulation)
- **torch.compile:** enabled (4x throughput vs eager)

## Model Variants

This repository contains two checkpoints from the same model lineage:

### fc-base (fullcorpus)

**File:** `fc-base.pt.zst`

The primary pretraining run. 170.4B tokens over 81,252 steps on the full 13-source data mix described above. Initialized from NCA+AttnRes checkpoint (seed-17, 852M NCA tokens). WSD schedule with cosine decay in the final 10%.

| Metric | Value |
|--------|-------|
| Final loss | 2.081 |
| Min loss | 1.982 (step 80,200) |
| Final perplexity | 8.01 |
| Tokens seen | 170.4B |
| Tokens/param ratio | ~1,581x |

### bcpt-base (books-CPT)

**File:** `bcpt-base.pt.zst`

Continued pretraining of the fullcorpus model on 36.2B tokens of book data from three Common Pile sources not present in the original data mix. Resumed from fullcorpus step 72,000 (pre-decay, 151B tokens seen) with fresh optimizer state and a new WSD schedule (500-step warmup, 90% stable, 10% cosine decay).

| Source | Tokens | % |
|--------|--------|---|
| Pre-1929 Books (Internet Archive/HathiTrust) | 19.1B | 52.8% |
| Library of Congress | 14.0B | 38.7% |
| DOAB (Open Access Books) | 3.1B | 8.6% |

OCR quality filter applied: documents with >5% garbage characters dropped.

| Metric | Value |
|--------|-------|
| Final loss | 2.342 |
| Min loss | 2.230 (step 17,260) |
| Final perplexity | 10.40 |
| Additional tokens | 36.4B (17,337 steps) |
| Total tokens seen | ~187.4B (resumed from step 72K / 151B tokens) |

The higher loss/perplexity relative to fullcorpus reflects the domain shift to OCR book text, not regression. The books-CPT variant trades general benchmark performance for improved performance on literary and long-form text.

## Evaluation

### LM-Eval Benchmarks

All benchmarks run zero-shot via lm-evaluation-harness.

| Benchmark | Metric | fc-base | bcpt-base |
|-----------|--------|---------|-----------|
| ARC-Easy | acc | 0.455 | 0.445 |
| ARC-Easy | acc_norm | 0.387 | 0.388 |
| BoolQ | acc | 0.559 | 0.499 |
| COPA | acc | 0.590 | 0.590 |
| HellaSwag | acc | 0.277 | 0.280 |
| HellaSwag | acc_norm | 0.297 | 0.295 |
| LAMBADA | acc | 0.281 | 0.297 |
| LAMBADA | ppl | 83.3 | 85.5 |
| PIQA | acc | 0.577 | 0.588 |
| PIQA | acc_norm | 0.569 | 0.571 |
| SciQ | acc | 0.783 | 0.779 |
| SciQ | acc_norm | 0.700 | 0.685 |
| WikiText | word_ppl | 41.76 | 52.09 |
| WikiText | bits/byte | 1.007 | 1.066 |
| Winogrande | acc | 0.508 | 0.515 |

**Notes:** These are proxy-scale (108M) results. Performance is expected at this scale -- the model was not designed to maximize benchmarks. The books-CPT variant shows slight improvements on commonsense/physical reasoning (PIQA, Winogrande, LAMBADA accuracy) and slight degradation on knowledge-heavy tasks (BoolQ, WikiText perplexity), consistent with the domain shift toward literary text.

## Analysis Highlights

The primary value of this model as a research artifact is the geometric monitoring data collected during training. The analysis packages in `fc-analysis/` and `bcpt-analysis/` contain activation geometry, concept geometry, and full metric histories.

### Geometric Health (Final Checkpoint)

Monitored at layers [0, 7, 14, 21, 27] throughout training.

| Metric | Value | Interpretation |
|--------|-------|----------------|
| RankMe (embedding) | 440.5 | High effective dimensionality (out of 512) |
| RankMe rebound ratio | 15.9x | Strong recovery from early collapse (min 27.7 at step 150) |
| WeightWatcher alpha | 7.71 | Within Muon-healthy range (see notes) |
| TwoNN intrinsic dim | 5.76 | Representation manifold dimensionality |
| Dead units | 0.0% | No dead neurons at any monitored layer |

### Stable Rank Profiles Across Depth

Stable rank (effective rank of weight matrices) remains high across all layers throughout training, a signature of Muon's balanced spectral updates. Representative values from the final checkpoint (step 81,225):

| Layer | Q proj | K proj | O proj | Gate proj | Down proj |
|-------|--------|--------|--------|-----------|-----------|
| 0 | 18.7 | 15.7 | 46.3 | 127.0 | 56.8 |
| 7 | 42.5 | 40.0 | 87.9 | 76.8 | 140.4 |
| 14 | 49.1 | 41.5 | 43.1 | 70.2 | 125.0 |
| 21 | 39.4 | 30.0 | 67.9 | 62.9 | 49.2 |
| 27 | 43.8 | 32.3 | 115.3 | 76.2 | 127.8 |

Key observations:
- **No low-rank collapse:** All weight matrices maintain high stable rank through 170B tokens. Under AdamW, these values would typically be 2-4x lower.
- **Depth utilization:** Non-monotonic stable rank profile indicates all layers are actively contributing (not degenerating into near-identity transformations).
- **Zero dead units:** No layer shows any dead neurons, even after extreme overtraining (1,581x tokens/parameter).

### Attention Entropy Across Depth

| Layer | Mean Entropy | Std | Interpretation |
|-------|-------------|-----|----------------|
| 0 | 6.13 | 0.43 | Broad attention (early feature mixing) |
| 7 | 4.64 | 0.77 | Selective attention with variance |
| 14 | 5.49 | 0.41 | Moderate selectivity |
| 21 | 5.68 | 0.29 | Moderate, low variance |
| 27 | 4.14 | 0.79 | Most selective (prediction heads) |

This gradient -- broad at the bottom, selective at the top -- is the healthy pattern. Crucially, **the deep layers (L27) maintain diverse attention patterns** (std=0.79) rather than collapsing to BOS-sink. In baseline models without AttnRes, layers 21-27 develop 89-90% BOS attention concentration by this training stage.

### Anisotropy Profile

| Layer | Anisotropy |
|-------|-----------|
| 0 | 0.066 |
| 7 | 0.452 |
| 14 | 0.413 |
| 21 | 0.148 |
| 27 | 0.090 |

The inverted-U anisotropy profile (low at edges, peaking at middle layers) indicates structured representational geometry rather than isotropy collapse or extreme anisotropy.

### AttnRes Effects (from Proxy Phase Ablations)

These findings come from the 5-run optimizer sweep at 6B tokens and the full 170B run:

- **BOS-sink prevention:** Baseline models develop 89-90% BOS attention at deep layers by 6B tokens. DD-v1 AttnRes prevents this entirely, maintaining diverse attention patterns at all depths.
- **4x gradient uniformity:** Gradient norm variance across layers is ~4x lower with AttnRes, enabling more uniform learning across depth.
- **Full depth utilization:** Without AttnRes, deep layers tend toward near-identity transformations. With AttnRes, stable rank and attention entropy remain diverse at all depths.
- **DD-v2 fragility:** Shifting even one block boundary (L12 to L14) produced 12/16 geometric metrics outside the range of all other configurations. Variable-size blocks cascade nonlinearly.

### NCA Pre-Pretraining Effects

- **Trains attention, not MLPs:** NCA pre-pretraining primarily structures attention weight matrices. MLP weights show minimal structured change, confirming that MLP reinit after NCA is correct.
- **L14 attractor basin:** NCA creates a distinctive geometric signature at layer 14 that persists through full language training. This basin is present regardless of AttnRes configuration.
- **Sub-additive with AttnRes:** NCA + AttnRes produces only +0.008 nats over the better of either alone, but preserves geometric properties from both techniques everywhere in the network.

## Key Findings (Proxy Phase)

1. **Muon lr=0.02 is the Pareto optimum** for 108M: matches AdamW final loss while maintaining 2-4x higher stable rank across all weight matrices.
2. **torch.compile is the dominant throughput optimization**, providing 4x improvement. Liger kernels without FusedLinearCE hurt compile by 13%.
3. **Extreme overtraining (1,581x tokens/param) does not cause geometric collapse** with Muon + AttnRes. Stable rank, attention entropy, and dead unit counts all remain healthy at 170B tokens.
4. **WW alpha healthy range is higher for Muon than AdamW.** Alpha values of 7-8 are normal for Muon-trained models; do not apply AdamW-calibrated thresholds (which would flag these as unhealthy).

## Usage

The checkpoints are stored as compressed PyTorch state dicts (`.pt.zst`). To load:

```python
import torch
import zstandard as zstd
import io

# Decompress
with open("fc-base.pt.zst", "rb") as f:
    dctx = zstd.ZstdDecompressor()
    decompressed = dctx.decompress(f.read())

# Load state dict
state_dict = torch.load(io.BytesIO(decompressed), map_location="cpu", weights_only=True)

# Initialize model (requires the kotodama training code)
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig

config = LuxiaModelConfig(
    hidden_size=512,
    num_layers=28,
    num_attention_heads=4,
    num_kv_heads=2,
    head_dim=128,
    intermediate_size=1408,
    vocab_size=49152,
    max_position_embeddings=4096,
    rope_theta=500000.0,
    qk_norm=True,
    tie_word_embeddings=True,
    z_loss_weight=1e-5,
    attn_res=True,
    attn_res_boundaries=[0, 3, 7, 12, 21, 25],
)

model = LuxiaBaseModel(config)
model.load_state_dict(state_dict)
```

**Tokenizer:** `HuggingFaceTB/SmolLM2-135M` (49,152 vocab, byte-fallback).

## Repository Contents

```
fc-base.pt.zst              # Fullcorpus final checkpoint (81,252 steps, 170.4B tokens)
bcpt-base.pt.zst             # Books-CPT checkpoint (17,337 additional steps, 36.4B tokens)
fc-analysis/                 # Fullcorpus analysis package
  activation_geometry/       # Per-layer activation extractions
  concept_geometry/          # Concept-level geometric analysis
  lm_eval/                   # Full lm-evaluation-harness results
  report.html                # Analysis report
bcpt-analysis/               # Books-CPT analysis package (same structure)
fc-metrics.jsonl             # Fullcorpus training metrics (loss, LR, throughput)
fc-geo_metrics.jsonl         # Fullcorpus geometric monitoring (stable rank, entropy, etc.)
bcpt-metrics.jsonl           # Books-CPT training metrics
bcpt-geo_metrics.jsonl       # Books-CPT geometric monitoring
```

## Limitations

- **108M proxy scale.** This model exists to validate architecture and optimizer choices, not to be useful for downstream tasks. Benchmark performance reflects this.
- **No raw code in training data.** The 645GB cleaned stack_v1 JSONL (~126B tokens, 130 languages) was never tokenized and is absent from the data mix. The model sees code only through reasoning traces (OpenCoderReasoning) and Q&A (StackExchange).
- **Conversational data < 1.2%.** The original spec targeted 25% conversational data. The actual mix is dominated by academic text (35.6%) and code reasoning (21.0%).
- **OCR noise in books-CPT.** Despite filtering documents with >5% garbage characters, the books-CPT data (pre-1929 scans, Library of Congress) contains residual OCR artifacts.
- **No deduplication** was applied to the books-CPT data (estimated minimal cross-source overlap between digitization projects, but not verified).
- **Eval methodology:** Top-p sampling catastrophically degrades generation quality at 108M scale. All evaluation uses pure temperature sampling only.

## Citation

```bibtex
@misc{kotodama2026,
  title={Kotodama: Block Attention Residuals and NCA Pre-Pretraining for Transformer Language Models},
  author={Aethera GP},
  year={2026},
  url={https://huggingface.co/aethera-gp/kotodama-108m-base}
}
```

### References

- Block Attention Residuals: see `Attention_Residuals.pdf` in the training repo
- NCA Pre-Pretraining: [Han et al., 2026](https://arxiv.org/abs/2603.10055)
- Muon Optimizer: [MoonshotAI/Muon](https://github.com/MoonshotAI/Muon); [Moonlight: Muon is Scalable for LLM Training](https://arxiv.org/abs/2502.16982)
- Gram-Newton-Schulz: [Dao-AILab/Gram-Newton-Schulz](https://github.com/Dao-AILab/Gram-Newton-Schulz)
- WeightWatcher: [Martin et al.](https://arxiv.org/abs/2102.11258)