Update README.md

e0b18a0 verified 3 days ago

18.1 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: pytorch
	tags:
	- research
	- transformer
	- attention-residuals
	- muon-optimizer
	- nca-pretraining
	- geometric-monitoring
	- causal-lm
	datasets:
	- allenai/peS2o
	- open-web-math/open-web-math
	- HuggingFaceTB/finemath
	- bigcode/the-stack
	- deepmind/pg19
	- pile-of-law/pile-of-law
	- OpenAssistant/oasst2
	pipeline_tag: text-generation
	model-index:
	- name: kotodama-108m-base
	results:
	- task:
	type: text-generation
	name: Language Modeling
	dataset:
	type: wikitext
	name: WikiText-2
	metrics:
	- name: Word Perplexity (fc-base)
	type: perplexity
	value: 41.76
	- name: Word Perplexity (bcpt-base)
	type: perplexity
	value: 52.09
	- task:
	type: multiple-choice
	name: ARC-Easy
	dataset:
	type: ai2_arc
	name: ARC-Easy
	metrics:
	- name: Accuracy (fc-base)
	type: accuracy
	value: 0.455
	- name: Accuracy (bcpt-base)
	type: accuracy
	value: 0.445
	---

	# Kotodama 108M Base

	A 108M parameter decoder-only transformer trained as a proxy model for validating architectural and optimizer choices before scaling to 3B parameters. This is a research artifact, not a production model.

	The model combines three techniques not previously studied together at this scale:

	- Block Attention Residuals (AttnRes) -- learned residual connections across transformer blocks that prevent BOS-sink attention collapse and produce 4x gradient uniformity across depth.
	- NCA pre-pretraining -- bootstrapping attention circuits using Neural Cellular Automata trajectories before language training, which trains attention patterns (not MLPs) and creates an L14 attractor basin in the representation manifold.
	- Muon optimizer -- spectral-norm steepest descent via Newton-Schulz orthogonalization, producing 2-4x higher stable rank than AdamW at matched loss, with Gram-NS optimized coefficients.

	Organization: [aethera-gp](https://huggingface.co/aethera-gp)
	Training code: [github.com/LuxiaSL/kotodama](https://github.com/LuxiaSL/kotodama)

	## Architecture

	The model uses a Llama-family architecture with QK-norm and Block Attention Residuals.

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Parameters \| 107.8M (+ 58.4K AttnRes) \|
	\| Hidden size \| 512 \|
	\| Layers \| 28 \|
	\| Query heads \| 4 \|
	\| KV heads \| 2 (GQA ratio 2:1) \|
	\| Head dim \| 128 \|
	\| Intermediate size (SwiGLU) \| 1408 \|
	\| Vocabulary \| 49,152 (SmolLM2 tokenizer) \|
	\| Max context \| 4,096 tokens \|
	\| Positional encoding \| RoPE (theta=500,000) \|
	\| Normalization \| Pre-RMSNorm + QK-norm \|
	\| Embeddings \| Tied input/output \|
	\| Bias \| None \|
	\| z-loss \| 1e-5 \|
	\| AttnRes block boundaries \| [0, 3, 7, 12, 21, 25] (DD-v1) \|

	### Block Attention Residuals (DD-v1)

	AttnRes adds per-layer learned pseudo-queries and key norms that create residual connections between block boundaries. The DD-v1 configuration divides the 28-layer network into 6 variable-size blocks at layers [0, 3, 7, 12, 21, 25]. This adds only 58.4K parameters (0.05% overhead) but has substantial effects on training dynamics.

	Each transformer block stores:
	- `attn_res_query` / `attn_res_norm`: attention sub-block residual
	- `mlp_res_query` / `mlp_res_norm`: MLP sub-block residual

	A final `final_res_query` / `final_res_norm` aggregates block outputs before the LM head.

	### Differences from stock Llama

	- QK-norm: RMSNorm on Q and K projections after linear projection, enabling higher learning rates
	- z-loss: LSE-squared regularization preventing logit explosion
	- Smaller vocab (49K vs 128K): reduces the Godey gradient bottleneck (~94% destruction at 3072/49K vs ~98% at 3072/128K for the 3B target)
	- Block AttnRes: cross-block residual connections (see above)

	## Training

	### Optimizer Configuration

	Hybrid Muon + AdamW: Muon handles 2D weight matrices (Q/K/V/O projections, FFN gate/up/down -- ~77% of parameters), AdamW handles everything else (embeddings, norms).

	\| Parameter \| Muon (2D weights) \| AdamW (embeddings, norms) \|
	\|-----------\|-------------------\|---------------------------\|
	\| Learning rate \| 0.02 \| 6e-4 \|
	\| Momentum / betas \| 0.95 (Nesterov) \| (0.9, 0.95) \|
	\| Weight decay \| 0.01 \| 0.1 \|
	\| NS iterations \| 5 (Gram-NS coefficients) \| -- \|

	Schedule: WSD (Warmup-Stable-Decay). 5,000 step warmup (~6%), stable plateau to 90% of training, cosine decay over final 10%.

	Gradient clipping: 1.0

	Precision: BF16 autocast with FP8 compute (FP32 optimizer states).

	### NCA Pre-Pretraining

	Before language training, attention weights were bootstrapped using NCA (Neural Cellular Automata) pre-pretraining following Han et al. (2026). An NCA checkpoint co-trained with AttnRes DD-v1 (seed-17, 852M tokens) was used as initialization. After NCA, embeddings were reinitialized to the language vocabulary while attention weights, MLPs, and norms from NCA training were preserved (embed-only reinit).

	### Data Mix (Fullcorpus)

	170.4B tokens from 13 sources, shuffled with seed 42, sequence length 4096.

	\| Source \| Tokens \| % \| Category \|
	\|--------\|--------\|---\|----------\|
	\| peS2o \| 60.7B \| 35.6% \| Academic papers (Semantic Scholar) \|
	\| OpenCoderReasoning \| 35.7B \| 21.0% \| Code reasoning (R1 + QwQ, Python/C++) \|
	\| Pile of Law \| 18.8B \| 11.0% \| Legal (court opinions, congressional) \|
	\| StackExchange \| 15.7B \| 9.2% \| Q&A (22 high-value sites) \|
	\| OpenWebMath \| 14.1B \| 8.2% \| Math web pages \|
	\| FineMath \| 10.8B \| 6.4% \| Quality-scored math (4+ score) \|
	\| PG-19 \| 7.5B \| 4.4% \| Books (Project Gutenberg, 71K) \|
	\| Wikipedia \| 5.0B \| 3.0% \| English Wikipedia \|
	\| SmolTalk \| 0.9B \| 0.6% \| Synthetic multi-turn dialogue \|
	\| WildChat \| 0.5B \| 0.3% \| Real user-GPT conversations \|
	\| SODA \| 0.3B \| 0.2% \| Synthetic social dialogue \|
	\| Enron \| 0.3B \| 0.2% \| Corporate email \|
	\| OASST2 \| 0.01B \| <0.1% \| Human multi-turn conversations \|

	Category breakdown: Academic/knowledge 38.6%, code reasoning 21.0%, math 14.6%, legal 11.0%, Q&A 9.2%, books 4.4%, conversation 1.1%.

	### Hardware and Compute

	- Hardware: 8x NVIDIA B200 (single node, NVLink)
	- Parallelism: DDP (DistributedDataParallel)
	- Throughput: ~1.96M tokens/sec average
	- Micro batch size: 16 per GPU
	- Global batch size: 2,097,152 tokens (16 * 4096 * 8 GPUs * gradient accumulation)
	- torch.compile: enabled (4x throughput vs eager)

	## Model Variants

	This repository contains two checkpoints from the same model lineage:

	### fc-base (fullcorpus)

	File: `fc-base.pt.zst`

	The primary pretraining run. 170.4B tokens over 81,252 steps on the full 13-source data mix described above. Initialized from NCA+AttnRes checkpoint (seed-17, 852M NCA tokens). WSD schedule with cosine decay in the final 10%.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Final loss \| 2.081 \|
	\| Min loss \| 1.982 (step 80,200) \|
	\| Final perplexity \| 8.01 \|
	\| Tokens seen \| 170.4B \|
	\| Tokens/param ratio \| ~1,581x \|

	### bcpt-base (books-CPT)

	File: `bcpt-base.pt.zst`

	Continued pretraining of the fullcorpus model on 36.2B tokens of book data from three Common Pile sources not present in the original data mix. Resumed from fullcorpus step 72,000 (pre-decay, 151B tokens seen) with fresh optimizer state and a new WSD schedule (500-step warmup, 90% stable, 10% cosine decay).

	\| Source \| Tokens \| % \|
	\|--------\|--------\|---\|
	\| Pre-1929 Books (Internet Archive/HathiTrust) \| 19.1B \| 52.8% \|
	\| Library of Congress \| 14.0B \| 38.7% \|
	\| DOAB (Open Access Books) \| 3.1B \| 8.6% \|

	OCR quality filter applied: documents with >5% garbage characters dropped.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Final loss \| 2.342 \|
	\| Min loss \| 2.230 (step 17,260) \|
	\| Final perplexity \| 10.40 \|
	\| Additional tokens \| 36.4B (17,337 steps) \|
	\| Total tokens seen \| ~187.4B (resumed from step 72K / 151B tokens) \|

	The higher loss/perplexity relative to fullcorpus reflects the domain shift to OCR book text, not regression. The books-CPT variant trades general benchmark performance for improved performance on literary and long-form text.

	## Evaluation

	### LM-Eval Benchmarks

	All benchmarks run zero-shot via lm-evaluation-harness.

	\| Benchmark \| Metric \| fc-base \| bcpt-base \|
	\|-----------\|--------\|---------\|-----------\|
	\| ARC-Easy \| acc \| 0.455 \| 0.445 \|
	\| ARC-Easy \| acc_norm \| 0.387 \| 0.388 \|
	\| BoolQ \| acc \| 0.559 \| 0.499 \|
	\| COPA \| acc \| 0.590 \| 0.590 \|
	\| HellaSwag \| acc \| 0.277 \| 0.280 \|
	\| HellaSwag \| acc_norm \| 0.297 \| 0.295 \|
	\| LAMBADA \| acc \| 0.281 \| 0.297 \|
	\| LAMBADA \| ppl \| 83.3 \| 85.5 \|
	\| PIQA \| acc \| 0.577 \| 0.588 \|
	\| PIQA \| acc_norm \| 0.569 \| 0.571 \|
	\| SciQ \| acc \| 0.783 \| 0.779 \|
	\| SciQ \| acc_norm \| 0.700 \| 0.685 \|
	\| WikiText \| word_ppl \| 41.76 \| 52.09 \|
	\| WikiText \| bits/byte \| 1.007 \| 1.066 \|
	\| Winogrande \| acc \| 0.508 \| 0.515 \|

	Notes: These are proxy-scale (108M) results. Performance is expected at this scale -- the model was not designed to maximize benchmarks. The books-CPT variant shows slight improvements on commonsense/physical reasoning (PIQA, Winogrande, LAMBADA accuracy) and slight degradation on knowledge-heavy tasks (BoolQ, WikiText perplexity), consistent with the domain shift toward literary text.

	## Analysis Highlights

	The primary value of this model as a research artifact is the geometric monitoring data collected during training. The analysis packages in `fc-analysis/` and `bcpt-analysis/` contain activation geometry, concept geometry, and full metric histories.

	### Geometric Health (Final Checkpoint)

	Monitored at layers [0, 7, 14, 21, 27] throughout training.

	\| Metric \| Value \| Interpretation \|
	\|--------\|-------\|----------------\|
	\| RankMe (embedding) \| 440.5 \| High effective dimensionality (out of 512) \|
	\| RankMe rebound ratio \| 15.9x \| Strong recovery from early collapse (min 27.7 at step 150) \|
	\| WeightWatcher alpha \| 7.71 \| Within Muon-healthy range (see notes) \|
	\| TwoNN intrinsic dim \| 5.76 \| Representation manifold dimensionality \|
	\| Dead units \| 0.0% \| No dead neurons at any monitored layer \|

	### Stable Rank Profiles Across Depth

	Stable rank (effective rank of weight matrices) remains high across all layers throughout training, a signature of Muon's balanced spectral updates. Representative values from the final checkpoint (step 81,225):

	\| Layer \| Q proj \| K proj \| O proj \| Gate proj \| Down proj \|
	\|-------\|--------\|--------\|--------\|-----------\|-----------\|
	\| 0 \| 18.7 \| 15.7 \| 46.3 \| 127.0 \| 56.8 \|
	\| 7 \| 42.5 \| 40.0 \| 87.9 \| 76.8 \| 140.4 \|
	\| 14 \| 49.1 \| 41.5 \| 43.1 \| 70.2 \| 125.0 \|
	\| 21 \| 39.4 \| 30.0 \| 67.9 \| 62.9 \| 49.2 \|
	\| 27 \| 43.8 \| 32.3 \| 115.3 \| 76.2 \| 127.8 \|

	Key observations:
	- No low-rank collapse: All weight matrices maintain high stable rank through 170B tokens. Under AdamW, these values would typically be 2-4x lower.
	- Depth utilization: Non-monotonic stable rank profile indicates all layers are actively contributing (not degenerating into near-identity transformations).
	- Zero dead units: No layer shows any dead neurons, even after extreme overtraining (1,581x tokens/parameter).

	### Attention Entropy Across Depth

	\| Layer \| Mean Entropy \| Std \| Interpretation \|
	\|-------\|-------------\|-----\|----------------\|
	\| 0 \| 6.13 \| 0.43 \| Broad attention (early feature mixing) \|
	\| 7 \| 4.64 \| 0.77 \| Selective attention with variance \|
	\| 14 \| 5.49 \| 0.41 \| Moderate selectivity \|
	\| 21 \| 5.68 \| 0.29 \| Moderate, low variance \|
	\| 27 \| 4.14 \| 0.79 \| Most selective (prediction heads) \|

	This gradient -- broad at the bottom, selective at the top -- is the healthy pattern. Crucially, the deep layers (L27) maintain diverse attention patterns (std=0.79) rather than collapsing to BOS-sink. In baseline models without AttnRes, layers 21-27 develop 89-90% BOS attention concentration by this training stage.

	### Anisotropy Profile

	\| Layer \| Anisotropy \|
	\|-------\|-----------\|
	\| 0 \| 0.066 \|
	\| 7 \| 0.452 \|
	\| 14 \| 0.413 \|
	\| 21 \| 0.148 \|
	\| 27 \| 0.090 \|

	The inverted-U anisotropy profile (low at edges, peaking at middle layers) indicates structured representational geometry rather than isotropy collapse or extreme anisotropy.

	### AttnRes Effects (from Proxy Phase Ablations)

	These findings come from the 5-run optimizer sweep at 6B tokens and the full 170B run:

	- BOS-sink prevention: Baseline models develop 89-90% BOS attention at deep layers by 6B tokens. DD-v1 AttnRes prevents this entirely, maintaining diverse attention patterns at all depths.
	- 4x gradient uniformity: Gradient norm variance across layers is ~4x lower with AttnRes, enabling more uniform learning across depth.
	- Full depth utilization: Without AttnRes, deep layers tend toward near-identity transformations. With AttnRes, stable rank and attention entropy remain diverse at all depths.
	- DD-v2 fragility: Shifting even one block boundary (L12 to L14) produced 12/16 geometric metrics outside the range of all other configurations. Variable-size blocks cascade nonlinearly.

	### NCA Pre-Pretraining Effects

	- Trains attention, not MLPs: NCA pre-pretraining primarily structures attention weight matrices. MLP weights show minimal structured change, confirming that MLP reinit after NCA is correct.
	- L14 attractor basin: NCA creates a distinctive geometric signature at layer 14 that persists through full language training. This basin is present regardless of AttnRes configuration.
	- Sub-additive with AttnRes: NCA + AttnRes produces only +0.008 nats over the better of either alone, but preserves geometric properties from both techniques everywhere in the network.

	## Key Findings (Proxy Phase)

	1. Muon lr=0.02 is the Pareto optimum for 108M: matches AdamW final loss while maintaining 2-4x higher stable rank across all weight matrices.
	2. torch.compile is the dominant throughput optimization, providing 4x improvement. Liger kernels without FusedLinearCE hurt compile by 13%.
	3. Extreme overtraining (1,581x tokens/param) does not cause geometric collapse with Muon + AttnRes. Stable rank, attention entropy, and dead unit counts all remain healthy at 170B tokens.
	4. WW alpha healthy range is higher for Muon than AdamW. Alpha values of 7-8 are normal for Muon-trained models; do not apply AdamW-calibrated thresholds (which would flag these as unhealthy).

	## Usage

	The checkpoints are stored as compressed PyTorch state dicts (`.pt.zst`). To load:

	```python
	import torch
	import zstandard as zstd
	import io

	# Decompress
	with open("fc-base.pt.zst", "rb") as f:
	dctx = zstd.ZstdDecompressor()
	decompressed = dctx.decompress(f.read())

	# Load state dict
	state_dict = torch.load(io.BytesIO(decompressed), map_location="cpu", weights_only=True)

	# Initialize model (requires the kotodama training code)
	from src.model.llama import LuxiaBaseModel, LuxiaModelConfig

	config = LuxiaModelConfig(
	hidden_size=512,
	num_layers=28,
	num_attention_heads=4,
	num_kv_heads=2,
	head_dim=128,
	intermediate_size=1408,
	vocab_size=49152,
	max_position_embeddings=4096,
	rope_theta=500000.0,
	qk_norm=True,
	tie_word_embeddings=True,
	z_loss_weight=1e-5,
	attn_res=True,
	attn_res_boundaries=[0, 3, 7, 12, 21, 25],
	)

	model = LuxiaBaseModel(config)
	model.load_state_dict(state_dict)
	```

	Tokenizer: `HuggingFaceTB/SmolLM2-135M` (49,152 vocab, byte-fallback).

	## Repository Contents

	```
	fc-base.pt.zst # Fullcorpus final checkpoint (81,252 steps, 170.4B tokens)
	bcpt-base.pt.zst # Books-CPT checkpoint (17,337 additional steps, 36.4B tokens)
	fc-analysis/ # Fullcorpus analysis package
	activation_geometry/ # Per-layer activation extractions
	concept_geometry/ # Concept-level geometric analysis
	lm_eval/ # Full lm-evaluation-harness results
	report.html # Analysis report
	bcpt-analysis/ # Books-CPT analysis package (same structure)
	fc-metrics.jsonl # Fullcorpus training metrics (loss, LR, throughput)
	fc-geo_metrics.jsonl # Fullcorpus geometric monitoring (stable rank, entropy, etc.)
	bcpt-metrics.jsonl # Books-CPT training metrics
	bcpt-geo_metrics.jsonl # Books-CPT geometric monitoring
	```

	## Limitations

	- 108M proxy scale. This model exists to validate architecture and optimizer choices, not to be useful for downstream tasks. Benchmark performance reflects this.
	- No raw code in training data. The 645GB cleaned stack_v1 JSONL (~126B tokens, 130 languages) was never tokenized and is absent from the data mix. The model sees code only through reasoning traces (OpenCoderReasoning) and Q&A (StackExchange).
	- Conversational data < 1.2%. The original spec targeted 25% conversational data. The actual mix is dominated by academic text (35.6%) and code reasoning (21.0%).
	- OCR noise in books-CPT. Despite filtering documents with >5% garbage characters, the books-CPT data (pre-1929 scans, Library of Congress) contains residual OCR artifacts.
	- No deduplication was applied to the books-CPT data (estimated minimal cross-source overlap between digitization projects, but not verified).
	- Eval methodology: Top-p sampling catastrophically degrades generation quality at 108M scale. All evaluation uses pure temperature sampling only.

	## Citation

	```bibtex
	@misc{kotodama2026,
	title={Kotodama: Block Attention Residuals and NCA Pre-Pretraining for Transformer Language Models},
	author={Aethera GP},
	year={2026},
	url={https://huggingface.co/aethera-gp/kotodama-108m-base}
	}
	```

	### References

	- Block Attention Residuals: see `Attention_Residuals.pdf` in the training repo
	- NCA Pre-Pretraining: [Han et al., 2026](https://arxiv.org/abs/2603.10055)
	- Muon Optimizer: [MoonshotAI/Muon](https://github.com/MoonshotAI/Muon); [Moonlight: Muon is Scalable for LLM Training](https://arxiv.org/abs/2502.16982)
	- Gram-Newton-Schulz: [Dao-AILab/Gram-Newton-Schulz](https://github.com/Dao-AILab/Gram-Newton-Schulz)
	- WeightWatcher: [Martin et al.](https://arxiv.org/abs/2102.11258)