File size: 14,125 Bytes
fbfdc52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - causal-lm
  - sft
  - chatml
  - attention-residuals
  - muon
  - research
base_model: aethera-gp/kotodama-108m-base
pipeline_tag: text-generation
---

# Kotodama 108M Instruct

A 108M parameter instruction-tuned transformer trained with full-parameter SFT on freeform dialogue data. This is **not** an assistant-shaped model -- the SFT objective is learning turn structure (ChatML delimiters, turn-taking cadence) rather than instruction-following or helpfulness optimization.

Two instruct variants are provided, corresponding to SFT on each of the two base model checkpoints:

| Variant | File | Base | Eval Loss | im_end@1 | Overfit Gap |
|---------|------|------|-----------|----------|-------------|
| **Fullcorpus Instruct** | `fc-instruct.pt` | fullcorpus-ddv1 step 81252 | **2.401** | **0.446** | 0.070 |
| **Books-CPT Instruct** | `bcpt-instruct.pt` | books-cpt step 17336 | 2.517 | 0.397 | 0.102 |

Both were trained with identical SFT data and hyperparameters. The fullcorpus variant wins on absolute metrics; the books-CPT variant exhibits superior gradient properties during training.

## Base Model

Both variants build on [kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base), a from-scratch Llama-family transformer trained with the Muon optimizer and Block Attention Residuals (AttnRes).

**Proxy architecture (108M):**

| Parameter | Value |
|-----------|-------|
| d_model | 512 |
| n_layers | 28 |
| Query heads | 4 |
| KV heads | 2 (GQA 2:1) |
| head_dim | 128 |
| FFN intermediate | 1408 (SwiGLU) |
| Vocab size | 49152 (SmolLM2 tokenizer) |
| Max position | 4096 (RoPE, theta=500K) |
| Normalization | RMSNorm + QK-norm |
| Tied embeddings | Yes |
| Bias | None |
| z-loss | 1e-5 |
| AttnRes | DD-v1, boundaries [0, 3, 7, 12, 21, 25] |

The fullcorpus base was pretrained on 170.4B tokens (13 sources, academic/code-reasoning/math/legal/books/conversation). The books-CPT variant continued pretraining on 36.4B tokens of public domain books (Common Pile: Internet Archive, Library of Congress, DOAB).

## SFT Data

8.1M total tokens (5.5M trainable, 68.4% trainable ratio), 6,187 conversations split 90/10 train/eval. Pretokenized with ChatML template, chunked at turn boundaries to fit 4096 seq_len, packed with first-fit-decreasing bin packing into 1,976 fixed-length bins.

| Source | Conversations | Est. Tokens | Description |
|--------|--------------|-------------|-------------|
| Infinite Backrooms | 821 | ~5.7M | Model-to-model freeform dialogue between Claude instances |
| OASST2 top-ranked | 5,366 | ~2.8M | Human multi-turn conversations (rank==0, English only) |

**Data philosophy.** The SFT corpus is deliberately composed of freeform dialogue rather than instruction-following data. Infinite Backrooms conversations (scraped from dreams-of-an-electric-mind.webflow.io) capture two Claude instances in unstructured, extended conversation across 19 scenario types and multiple model generations (Opus 3, Sonnet 3.5, Opus 4, Sonnet 4, Sonnet 4.5). OASST2 contributes genuine human conversational patterns via the highest-ranked response path through each conversation tree.

Explicitly excluded: Alpaca, SlimOrca, UltraChat (too assistant-shaped), ShareGPT/WildChat (noisy, refusal artifacts), SODA/SmolTalk (already in pretraining data).

**Processing details:**
- Backrooms: actor names discovered dynamically per conversation; first speaker mapped to `user`, second to `assistant`. OOC preamble stripped, ANSI escape codes stripped, conversations with fewer than 3 turns dropped.
- OASST2: re-extracted from HuggingFace raw data (not the curation pipeline output, which lost rank metadata). Follows rank==0 path at each tree branch.
- Chunking: 571 conversations exceeded 4096 tokens and were split at turn boundaries into 1,705 non-overlapping chunks. Only 1/6,703 examples was truncated after chunking.

## Training

### Hyperparameter Sweep

A 18-config sweep was run across both base models: 12 configs for fullcorpus (4 learning rates x 3 epoch counts) and 6 for books-CPT (3 learning rates x 2 epoch counts). All configs used flat LR schedule (warmup 5%, `wsd_decay_start: 1.0` -- no decay phase).

**Winner for both bases:** Muon lr=3e-3 (AdamW lr=3e-4), 2 epochs.

Selection criteria: lowest eval loss with overfit ratio below 1.05 and highest im_end@1 among non-overfitting configs.

### Winning Config

```yaml
# Shared across both variants
muon_lr: 0.003
adamw_lr: 0.0003
num_epochs: 2
batch_size: 4           # per GPU
gradient_accumulation: 1
max_seq_len: 4096
bf16: true
max_grad_norm: 1.0
warmup_ratio: 0.05
wsd_decay_start: 1.0    # flat LR, no decay
muon_momentum: 0.95
muon_weight_decay: 0.01
muon_ns_iterations: 5
muon_ns_coefficients: gram_ns
adamw_betas: [0.9, 0.95]
adamw_weight_decay: 0.1
packed: true             # FFD bin-packed with block-diagonal SDPA masks
attn_res: true
attn_res_boundaries: [0, 3, 7, 12, 21, 25]
```

AttnRes routing weights were **frozen** during SFT -- only the base model parameters were updated. The Muon optimizer handles all 2D weight matrices; AdamW handles embeddings, layer norms, and AttnRes parameters.

### Hardware

- 8x GPU (B200), single node
- HuggingFace Trainer (not DDP torchrun)
- ~90K tokens/sec throughput
- ~98 GiB GPU memory allocated
- Async checkpoints with SHM staging

## Variants

The 2x2 comparison (2 base checkpoints x SFT) reveals a clear tradeoff:

### Fullcorpus Instruct (`fc-instruct.pt`)

- **Lower eval loss** (2.401 vs 2.517) -- 4.6% advantage
- **Higher im_end@1** (0.446 vs 0.397) -- better turn boundary prediction
- **Less overfit** (gap 0.070 vs 0.102, ratio 1.030 vs 1.042)
- Recommended as the primary instruct variant

### Books-CPT Instruct (`bcpt-instruct.pt`)

- **Substantially lower gradient norm variance** -- more stable and uniform gradient flow across layers throughout training
- **1.41x faster eval loss descent** (mean slope -0.00321 vs -0.00227) -- learns the SFT objective more efficiently per step
- Higher absolute loss reflects the books-CPT base starting from a different loss surface (books domain shift), not SFT quality
- The gradient uniformity advantage from books continued pretraining survives SFT intact

## Evaluation

### Fullcorpus Instruct — Training Trajectory

| Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
|------|-----------|----------|-------------|---------------|----------|----------|
| 25 | 2.557 | 12.90 | -0.158 | 0.942 | 0.092 | 0.337 |
| 50 | 2.477 | 11.91 | -0.122 | 0.953 | 0.337 | 0.538 |
| 75 | 2.433 | 11.39 | 0.057 | 1.024 | 0.370 | 0.543 |
| 100 | **2.401** | **11.04** | 0.070 | 1.030 | 0.386 | 0.543 |
| 110 | -- | -- | -- | -- | **0.446** | 0.554 |
| 120 | -- | -- | -- | -- | 0.424 | 0.560 |

Best eval loss: **2.401** at step 100. Best im_end@1: **0.446** at step 110.

### Books-CPT Instruct — Training Trajectory

| Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
|------|-----------|----------|-------------|---------------|----------|----------|
| 25 | 2.737 | 15.44 | -0.126 | 0.956 | 0.033 | 0.245 |
| 50 | 2.625 | 13.80 | -0.076 | 0.972 | 0.304 | 0.500 |
| 75 | 2.561 | 12.95 | 0.091 | 1.037 | 0.348 | 0.522 |
| 100 | **2.517** | **12.39** | 0.102 | 1.042 | 0.359 | 0.543 |
| 110 | -- | -- | -- | -- | 0.391 | 0.560 |
| 120 | -- | -- | -- | -- | **0.397** | **0.576** |

Best eval loss: **2.517** at step 100. Best im_end@1: **0.397** at step 120.

### Metric Definitions

- **im_end@1 / im_end@5:** Top-1 / top-5 accuracy of predicting the `<|im_end|>` token at actual turn boundaries in the eval set. Measures whether the model has learned when to stop generating within a turn.
- **im_start@1:** Top-1 accuracy for `<|im_start|>` prediction. Near-zero for both variants (0.0 in most checkpoints), indicating the model has not learned to predict turn-start tokens -- expected given the small SFT corpus and the fact that turn starts are mostly predictable from context.
- **Overfit gap:** train_loss - eval_loss. Positive values indicate overfitting.
- **Overfit ratio:** train_loss / eval_loss. Values above 1.0 indicate overfitting.

## SFT Sweep Results

### Fullcorpus Base (12 configs)

| Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
|--------|-----------|--------|-------|-----------|----------|---------------|
| lr1e-02-ep1 | 0.01 | 1 | 62 | 2.409 | 0.429 | 0.971 |
| lr1e-02-ep2 | 0.01 | 2 | 124 | 2.358 | 0.418 | 1.121 |
| lr1e-02-ep3 | 0.01 | 3 | 186 | 2.373 | 0.391 | 1.292 |
| lr1e-03-ep1 | 0.001 | 1 | 62 | 2.569 | 0.168 | 0.942 |
| lr1e-03-ep2 | 0.001 | 2 | 124 | 2.496 | 0.332 | 0.992 |
| lr1e-03-ep3 | 0.001 | 3 | 186 | 2.440 | 0.397 | 1.023 |
| **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.474 | 0.364 | 0.954 |
| **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.401** | **0.424** | **1.030** |
| lr3e-03-ep3 | 0.003 | 3 | 186 | 2.352 | 0.413 | 1.094 |
| lr3e-04-ep1 | 0.0003 | 1 | 62 | 2.679 | 0.022 | 0.939 |
| lr3e-04-ep2 | 0.0003 | 2 | 124 | 2.615 | 0.071 | 0.974 |
| lr3e-04-ep3 | 0.0003 | 3 | 186 | 2.556 | 0.163 | 0.992 |

lr=0.01 achieves the lowest absolute eval loss (2.358 at 2 epochs) but with severe overfitting (ratio 1.12). lr=3e-4 learns too slowly -- im_end@1 barely reaches 0.16 even at 3 epochs. **lr=3e-3 at 2 epochs** is the Pareto optimum: strong eval loss (2.401) and im_end accuracy (0.424) with controlled overfitting (1.03).

### Books-CPT Base (6 configs)

| Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
|--------|-----------|--------|-------|-----------|----------|---------------|
| lr1e-02-ep1 | 0.01 | 1 | 62 | 2.506 | 0.424 | 0.986 |
| lr1e-02-ep2 | 0.01 | 2 | 124 | 2.426 | 0.424 | 1.124 |
| lr1e-03-ep1 | 0.001 | 1 | 62 | 2.758 | 0.076 | 0.964 |
| lr1e-03-ep2 | 0.001 | 2 | 124 | 2.657 | 0.293 | 1.010 |
| **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.620 | 0.353 | 0.972 |
| **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.517** | **0.397** | **1.042** |

Same pattern as fullcorpus: lr=3e-3 at 2 epochs is the best balance. lr=0.01 overfits aggressively by epoch 2.

## Usage

### Loading

```python
import torch
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig

# Build config (proxy architecture)
config = LuxiaModelConfig(
    hidden_size=512,
    num_layers=28,
    num_attention_heads=4,
    num_kv_heads=2,
    head_dim=128,
    intermediate_size=1408,
    vocab_size=49152,
    max_position_embeddings=4096,
    rope_theta=500000.0,
    tie_word_embeddings=True,
    attn_res=True,
    attn_res_boundaries=[0, 3, 7, 12, 21, 25],
)

model = LuxiaBaseModel(config)
state_dict = torch.load("fc-instruct.pt", map_location="cpu")
model.load_state_dict(state_dict["model"])
model.eval()
```

### Chat Template (ChatML)

```
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
```

The model uses SmolLM2's tokenizer (`HuggingFaceTB/SmolLM2-135M`) with ChatML special tokens:
- `<|im_start|>` = token 1
- `<|im_end|>` = token 2
- `<|endoftext|>` = token 0 (pad token)

### Inference Example

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

messages = [
    {"role": "user", "content": "Tell me about the nature of consciousness."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Autoregressive generation (LuxiaBaseModel has no .generate())
generated = input_ids.to(model.embed_tokens.weight.device)
with torch.no_grad():
    for _ in range(512):
        out = model(input_ids=generated)
        logits = out["logits"][:, -1, :] / 0.8
        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        generated = torch.cat([generated, next_token], dim=1)
        if next_token.item() == 2:  # <|im_end|>
            break

print(tokenizer.decode(generated[0], skip_special_tokens=False))
```

**Sampling note:** At 108M scale, avoid top-p sampling -- it catastrophically degrades output quality. Use pure temperature sampling only.

## Limitations

- **108M scale.** This is a proxy-scale research model. It demonstrates that the architecture and training pipeline work, but the model's generation quality is fundamentally limited by parameter count. It is not suitable for production use.
- **Not assistant-shaped.** The model has learned turn-taking structure but has not been trained to be helpful, harmless, or honest. It may produce incoherent, offensive, or factually incorrect outputs.
- **HF Trainer throughput.** The SFT sweep used HuggingFace Trainer rather than the pretraining DDP pipeline. This was a pragmatic choice for sweep automation but means throughput (~90K tok/s) is below what the custom DDP trainer achieves.
- **No geometric probes in sweep.** The pretraining pipeline includes geometric monitoring (intrinsic dimension, stable rank, attention entropy). These were not instrumented in the SFT sweep, so we cannot directly measure whether SFT preserves the geometric properties of the base models.
- **im_start accuracy near zero.** The model reliably learns to predict turn-end tokens but not turn-start tokens. This is likely a consequence of the small SFT corpus size and the high predictability of turn-start positions from context.
- **Small SFT corpus.** At 8.1M tokens (5.5M trainable), the SFT dataset is deliberately minimal. This is sufficient for learning turn structure but not for deep behavioral fine-tuning.

## Links

- **Base model:** [aethera-gp/kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base)
- **Training code:** [github.com/aethera-gp/kotodama](https://github.com/aethera-gp/kotodama) (posttraining/)
- **Wandb project:** [aethera/kotodama-sft-sweep](https://wandb.ai/aethera/kotodama-sft-sweep)
- **SFT data sources:**
  - [Infinite Backrooms](https://dreams-of-an-electric-mind.webflow.io/) by @andyayrey
  - [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2)