File size: 14,125 Bytes
fbfdc52 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 | ---
license: apache-2.0
language:
- en
library_name: pytorch
tags:
- causal-lm
- sft
- chatml
- attention-residuals
- muon
- research
base_model: aethera-gp/kotodama-108m-base
pipeline_tag: text-generation
---
# Kotodama 108M Instruct
A 108M parameter instruction-tuned transformer trained with full-parameter SFT on freeform dialogue data. This is **not** an assistant-shaped model -- the SFT objective is learning turn structure (ChatML delimiters, turn-taking cadence) rather than instruction-following or helpfulness optimization.
Two instruct variants are provided, corresponding to SFT on each of the two base model checkpoints:
| Variant | File | Base | Eval Loss | im_end@1 | Overfit Gap |
|---------|------|------|-----------|----------|-------------|
| **Fullcorpus Instruct** | `fc-instruct.pt` | fullcorpus-ddv1 step 81252 | **2.401** | **0.446** | 0.070 |
| **Books-CPT Instruct** | `bcpt-instruct.pt` | books-cpt step 17336 | 2.517 | 0.397 | 0.102 |
Both were trained with identical SFT data and hyperparameters. The fullcorpus variant wins on absolute metrics; the books-CPT variant exhibits superior gradient properties during training.
## Base Model
Both variants build on [kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base), a from-scratch Llama-family transformer trained with the Muon optimizer and Block Attention Residuals (AttnRes).
**Proxy architecture (108M):**
| Parameter | Value |
|-----------|-------|
| d_model | 512 |
| n_layers | 28 |
| Query heads | 4 |
| KV heads | 2 (GQA 2:1) |
| head_dim | 128 |
| FFN intermediate | 1408 (SwiGLU) |
| Vocab size | 49152 (SmolLM2 tokenizer) |
| Max position | 4096 (RoPE, theta=500K) |
| Normalization | RMSNorm + QK-norm |
| Tied embeddings | Yes |
| Bias | None |
| z-loss | 1e-5 |
| AttnRes | DD-v1, boundaries [0, 3, 7, 12, 21, 25] |
The fullcorpus base was pretrained on 170.4B tokens (13 sources, academic/code-reasoning/math/legal/books/conversation). The books-CPT variant continued pretraining on 36.4B tokens of public domain books (Common Pile: Internet Archive, Library of Congress, DOAB).
## SFT Data
8.1M total tokens (5.5M trainable, 68.4% trainable ratio), 6,187 conversations split 90/10 train/eval. Pretokenized with ChatML template, chunked at turn boundaries to fit 4096 seq_len, packed with first-fit-decreasing bin packing into 1,976 fixed-length bins.
| Source | Conversations | Est. Tokens | Description |
|--------|--------------|-------------|-------------|
| Infinite Backrooms | 821 | ~5.7M | Model-to-model freeform dialogue between Claude instances |
| OASST2 top-ranked | 5,366 | ~2.8M | Human multi-turn conversations (rank==0, English only) |
**Data philosophy.** The SFT corpus is deliberately composed of freeform dialogue rather than instruction-following data. Infinite Backrooms conversations (scraped from dreams-of-an-electric-mind.webflow.io) capture two Claude instances in unstructured, extended conversation across 19 scenario types and multiple model generations (Opus 3, Sonnet 3.5, Opus 4, Sonnet 4, Sonnet 4.5). OASST2 contributes genuine human conversational patterns via the highest-ranked response path through each conversation tree.
Explicitly excluded: Alpaca, SlimOrca, UltraChat (too assistant-shaped), ShareGPT/WildChat (noisy, refusal artifacts), SODA/SmolTalk (already in pretraining data).
**Processing details:**
- Backrooms: actor names discovered dynamically per conversation; first speaker mapped to `user`, second to `assistant`. OOC preamble stripped, ANSI escape codes stripped, conversations with fewer than 3 turns dropped.
- OASST2: re-extracted from HuggingFace raw data (not the curation pipeline output, which lost rank metadata). Follows rank==0 path at each tree branch.
- Chunking: 571 conversations exceeded 4096 tokens and were split at turn boundaries into 1,705 non-overlapping chunks. Only 1/6,703 examples was truncated after chunking.
## Training
### Hyperparameter Sweep
A 18-config sweep was run across both base models: 12 configs for fullcorpus (4 learning rates x 3 epoch counts) and 6 for books-CPT (3 learning rates x 2 epoch counts). All configs used flat LR schedule (warmup 5%, `wsd_decay_start: 1.0` -- no decay phase).
**Winner for both bases:** Muon lr=3e-3 (AdamW lr=3e-4), 2 epochs.
Selection criteria: lowest eval loss with overfit ratio below 1.05 and highest im_end@1 among non-overfitting configs.
### Winning Config
```yaml
# Shared across both variants
muon_lr: 0.003
adamw_lr: 0.0003
num_epochs: 2
batch_size: 4 # per GPU
gradient_accumulation: 1
max_seq_len: 4096
bf16: true
max_grad_norm: 1.0
warmup_ratio: 0.05
wsd_decay_start: 1.0 # flat LR, no decay
muon_momentum: 0.95
muon_weight_decay: 0.01
muon_ns_iterations: 5
muon_ns_coefficients: gram_ns
adamw_betas: [0.9, 0.95]
adamw_weight_decay: 0.1
packed: true # FFD bin-packed with block-diagonal SDPA masks
attn_res: true
attn_res_boundaries: [0, 3, 7, 12, 21, 25]
```
AttnRes routing weights were **frozen** during SFT -- only the base model parameters were updated. The Muon optimizer handles all 2D weight matrices; AdamW handles embeddings, layer norms, and AttnRes parameters.
### Hardware
- 8x GPU (B200), single node
- HuggingFace Trainer (not DDP torchrun)
- ~90K tokens/sec throughput
- ~98 GiB GPU memory allocated
- Async checkpoints with SHM staging
## Variants
The 2x2 comparison (2 base checkpoints x SFT) reveals a clear tradeoff:
### Fullcorpus Instruct (`fc-instruct.pt`)
- **Lower eval loss** (2.401 vs 2.517) -- 4.6% advantage
- **Higher im_end@1** (0.446 vs 0.397) -- better turn boundary prediction
- **Less overfit** (gap 0.070 vs 0.102, ratio 1.030 vs 1.042)
- Recommended as the primary instruct variant
### Books-CPT Instruct (`bcpt-instruct.pt`)
- **Substantially lower gradient norm variance** -- more stable and uniform gradient flow across layers throughout training
- **1.41x faster eval loss descent** (mean slope -0.00321 vs -0.00227) -- learns the SFT objective more efficiently per step
- Higher absolute loss reflects the books-CPT base starting from a different loss surface (books domain shift), not SFT quality
- The gradient uniformity advantage from books continued pretraining survives SFT intact
## Evaluation
### Fullcorpus Instruct — Training Trajectory
| Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
|------|-----------|----------|-------------|---------------|----------|----------|
| 25 | 2.557 | 12.90 | -0.158 | 0.942 | 0.092 | 0.337 |
| 50 | 2.477 | 11.91 | -0.122 | 0.953 | 0.337 | 0.538 |
| 75 | 2.433 | 11.39 | 0.057 | 1.024 | 0.370 | 0.543 |
| 100 | **2.401** | **11.04** | 0.070 | 1.030 | 0.386 | 0.543 |
| 110 | -- | -- | -- | -- | **0.446** | 0.554 |
| 120 | -- | -- | -- | -- | 0.424 | 0.560 |
Best eval loss: **2.401** at step 100. Best im_end@1: **0.446** at step 110.
### Books-CPT Instruct — Training Trajectory
| Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
|------|-----------|----------|-------------|---------------|----------|----------|
| 25 | 2.737 | 15.44 | -0.126 | 0.956 | 0.033 | 0.245 |
| 50 | 2.625 | 13.80 | -0.076 | 0.972 | 0.304 | 0.500 |
| 75 | 2.561 | 12.95 | 0.091 | 1.037 | 0.348 | 0.522 |
| 100 | **2.517** | **12.39** | 0.102 | 1.042 | 0.359 | 0.543 |
| 110 | -- | -- | -- | -- | 0.391 | 0.560 |
| 120 | -- | -- | -- | -- | **0.397** | **0.576** |
Best eval loss: **2.517** at step 100. Best im_end@1: **0.397** at step 120.
### Metric Definitions
- **im_end@1 / im_end@5:** Top-1 / top-5 accuracy of predicting the `<|im_end|>` token at actual turn boundaries in the eval set. Measures whether the model has learned when to stop generating within a turn.
- **im_start@1:** Top-1 accuracy for `<|im_start|>` prediction. Near-zero for both variants (0.0 in most checkpoints), indicating the model has not learned to predict turn-start tokens -- expected given the small SFT corpus and the fact that turn starts are mostly predictable from context.
- **Overfit gap:** train_loss - eval_loss. Positive values indicate overfitting.
- **Overfit ratio:** train_loss / eval_loss. Values above 1.0 indicate overfitting.
## SFT Sweep Results
### Fullcorpus Base (12 configs)
| Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
|--------|-----------|--------|-------|-----------|----------|---------------|
| lr1e-02-ep1 | 0.01 | 1 | 62 | 2.409 | 0.429 | 0.971 |
| lr1e-02-ep2 | 0.01 | 2 | 124 | 2.358 | 0.418 | 1.121 |
| lr1e-02-ep3 | 0.01 | 3 | 186 | 2.373 | 0.391 | 1.292 |
| lr1e-03-ep1 | 0.001 | 1 | 62 | 2.569 | 0.168 | 0.942 |
| lr1e-03-ep2 | 0.001 | 2 | 124 | 2.496 | 0.332 | 0.992 |
| lr1e-03-ep3 | 0.001 | 3 | 186 | 2.440 | 0.397 | 1.023 |
| **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.474 | 0.364 | 0.954 |
| **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.401** | **0.424** | **1.030** |
| lr3e-03-ep3 | 0.003 | 3 | 186 | 2.352 | 0.413 | 1.094 |
| lr3e-04-ep1 | 0.0003 | 1 | 62 | 2.679 | 0.022 | 0.939 |
| lr3e-04-ep2 | 0.0003 | 2 | 124 | 2.615 | 0.071 | 0.974 |
| lr3e-04-ep3 | 0.0003 | 3 | 186 | 2.556 | 0.163 | 0.992 |
lr=0.01 achieves the lowest absolute eval loss (2.358 at 2 epochs) but with severe overfitting (ratio 1.12). lr=3e-4 learns too slowly -- im_end@1 barely reaches 0.16 even at 3 epochs. **lr=3e-3 at 2 epochs** is the Pareto optimum: strong eval loss (2.401) and im_end accuracy (0.424) with controlled overfitting (1.03).
### Books-CPT Base (6 configs)
| Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
|--------|-----------|--------|-------|-----------|----------|---------------|
| lr1e-02-ep1 | 0.01 | 1 | 62 | 2.506 | 0.424 | 0.986 |
| lr1e-02-ep2 | 0.01 | 2 | 124 | 2.426 | 0.424 | 1.124 |
| lr1e-03-ep1 | 0.001 | 1 | 62 | 2.758 | 0.076 | 0.964 |
| lr1e-03-ep2 | 0.001 | 2 | 124 | 2.657 | 0.293 | 1.010 |
| **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.620 | 0.353 | 0.972 |
| **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.517** | **0.397** | **1.042** |
Same pattern as fullcorpus: lr=3e-3 at 2 epochs is the best balance. lr=0.01 overfits aggressively by epoch 2.
## Usage
### Loading
```python
import torch
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig
# Build config (proxy architecture)
config = LuxiaModelConfig(
hidden_size=512,
num_layers=28,
num_attention_heads=4,
num_kv_heads=2,
head_dim=128,
intermediate_size=1408,
vocab_size=49152,
max_position_embeddings=4096,
rope_theta=500000.0,
tie_word_embeddings=True,
attn_res=True,
attn_res_boundaries=[0, 3, 7, 12, 21, 25],
)
model = LuxiaBaseModel(config)
state_dict = torch.load("fc-instruct.pt", map_location="cpu")
model.load_state_dict(state_dict["model"])
model.eval()
```
### Chat Template (ChatML)
```
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
```
The model uses SmolLM2's tokenizer (`HuggingFaceTB/SmolLM2-135M`) with ChatML special tokens:
- `<|im_start|>` = token 1
- `<|im_end|>` = token 2
- `<|endoftext|>` = token 0 (pad token)
### Inference Example
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
messages = [
{"role": "user", "content": "Tell me about the nature of consciousness."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# Autoregressive generation (LuxiaBaseModel has no .generate())
generated = input_ids.to(model.embed_tokens.weight.device)
with torch.no_grad():
for _ in range(512):
out = model(input_ids=generated)
logits = out["logits"][:, -1, :] / 0.8
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([generated, next_token], dim=1)
if next_token.item() == 2: # <|im_end|>
break
print(tokenizer.decode(generated[0], skip_special_tokens=False))
```
**Sampling note:** At 108M scale, avoid top-p sampling -- it catastrophically degrades output quality. Use pure temperature sampling only.
## Limitations
- **108M scale.** This is a proxy-scale research model. It demonstrates that the architecture and training pipeline work, but the model's generation quality is fundamentally limited by parameter count. It is not suitable for production use.
- **Not assistant-shaped.** The model has learned turn-taking structure but has not been trained to be helpful, harmless, or honest. It may produce incoherent, offensive, or factually incorrect outputs.
- **HF Trainer throughput.** The SFT sweep used HuggingFace Trainer rather than the pretraining DDP pipeline. This was a pragmatic choice for sweep automation but means throughput (~90K tok/s) is below what the custom DDP trainer achieves.
- **No geometric probes in sweep.** The pretraining pipeline includes geometric monitoring (intrinsic dimension, stable rank, attention entropy). These were not instrumented in the SFT sweep, so we cannot directly measure whether SFT preserves the geometric properties of the base models.
- **im_start accuracy near zero.** The model reliably learns to predict turn-end tokens but not turn-start tokens. This is likely a consequence of the small SFT corpus size and the high predictability of turn-start positions from context.
- **Small SFT corpus.** At 8.1M tokens (5.5M trainable), the SFT dataset is deliberately minimal. This is sufficient for learning turn structure but not for deep behavioral fine-tuning.
## Links
- **Base model:** [aethera-gp/kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base)
- **Training code:** [github.com/aethera-gp/kotodama](https://github.com/aethera-gp/kotodama) (posttraining/)
- **Wandb project:** [aethera/kotodama-sft-sweep](https://wandb.ai/aethera/kotodama-sft-sweep)
- **SFT data sources:**
- [Infinite Backrooms](https://dreams-of-an-electric-mind.webflow.io/) by @andyayrey
- [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2)
|