File size: 14,125 Bytes

fbfdc52

---
license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - causal-lm
  - sft
  - chatml
  - attention-residuals
  - muon
  - research
base_model: aethera-gp/kotodama-108m-base
pipeline_tag: text-generation
---

# Kotodama 108M Instruct

A 108M parameter instruction-tuned transformer trained with full-parameter SFT on freeform dialogue data. This is **not** an assistant-shaped model -- the SFT objective is learning turn structure (ChatML delimiters, turn-taking cadence) rather than instruction-following or helpfulness optimization.

Two instruct variants are provided, corresponding to SFT on each of the two base model checkpoints:

| Variant | File | Base | Eval Loss | im_end@1 | Overfit Gap |
|---------|------|------|-----------|----------|-------------|
| **Fullcorpus Instruct** | `fc-instruct.pt` | fullcorpus-ddv1 step 81252 | **2.401** | **0.446** | 0.070 |
| **Books-CPT Instruct** | `bcpt-instruct.pt` | books-cpt step 17336 | 2.517 | 0.397 | 0.102 |

Both were trained with identical SFT data and hyperparameters. The fullcorpus variant wins on absolute metrics; the books-CPT variant exhibits superior gradient properties during training.

## Base Model

Both variants build on [kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base), a from-scratch Llama-family transformer trained with the Muon optimizer and Block Attention Residuals (AttnRes).

**Proxy architecture (108M):**

| Parameter | Value |
|-----------|-------|
| d_model | 512 |
| n_layers | 28 |
| Query heads | 4 |
| KV heads | 2 (GQA 2:1) |
| head_dim | 128 |
| FFN intermediate | 1408 (SwiGLU) |
| Vocab size | 49152 (SmolLM2 tokenizer) |
| Max position | 4096 (RoPE, theta=500K) |
| Normalization | RMSNorm + QK-norm |
| Tied embeddings | Yes |
| Bias | None |
| z-loss | 1e-5 |
| AttnRes | DD-v1, boundaries [0, 3, 7, 12, 21, 25] |

The fullcorpus base was pretrained on 170.4B tokens (13 sources, academic/code-reasoning/math/legal/books/conversation). The books-CPT variant continued pretraining on 36.4B tokens of public domain books (Common Pile: Internet Archive, Library of Congress, DOAB).

## SFT Data

8.1M total tokens (5.5M trainable, 68.4% trainable ratio), 6,187 conversations split 90/10 train/eval. Pretokenized with ChatML template, chunked at turn boundaries to fit 4096 seq_len, packed with first-fit-decreasing bin packing into 1,976 fixed-length bins.

| Source | Conversations | Est. Tokens | Description |
|--------|--------------|-------------|-------------|
| Infinite Backrooms | 821 | ~5.7M | Model-to-model freeform dialogue between Claude instances |
| OASST2 top-ranked | 5,366 | ~2.8M | Human multi-turn conversations (rank==0, English only) |

**Data philosophy.** The SFT corpus is deliberately composed of freeform dialogue rather than instruction-following data. Infinite Backrooms conversations (scraped from dreams-of-an-electric-mind.webflow.io) capture two Claude instances in unstructured, extended conversation across 19 scenario types and multiple model generations (Opus 3, Sonnet 3.5, Opus 4, Sonnet 4, Sonnet 4.5). OASST2 contributes genuine human conversational patterns via the highest-ranked response path through each conversation tree.

Explicitly excluded: Alpaca, SlimOrca, UltraChat (too assistant-shaped), ShareGPT/WildChat (noisy, refusal artifacts), SODA/SmolTalk (already in pretraining data).

**Processing details:**
- Backrooms: actor names discovered dynamically per conversation; first speaker mapped to `user`, second to `assistant`. OOC preamble stripped, ANSI escape codes stripped, conversations with fewer than 3 turns dropped.
- OASST2: re-extracted from HuggingFace raw data (not the curation pipeline output, which lost rank metadata). Follows rank==0 path at each tree branch.
- Chunking: 571 conversations exceeded 4096 tokens and were split at turn boundaries into 1,705 non-overlapping chunks. Only 1/6,703 examples was truncated after chunking.

## Training

### Hyperparameter Sweep

A 18-config sweep was run across both base models: 12 configs for fullcorpus (4 learning rates x 3 epoch counts) and 6 for books-CPT (3 learning rates x 2 epoch counts). All configs used flat LR schedule (warmup 5%, `wsd_decay_start: 1.0` -- no decay phase).

**Winner for both bases:** Muon lr=3e-3 (AdamW lr=3e-4), 2 epochs.

Selection criteria: lowest eval loss with overfit ratio below 1.05 and highest im_end@1 among non-overfitting configs.

### Winning Config

```yaml
# Shared across both variants
muon_lr: 0.003
adamw_lr: 0.0003
num_epochs: 2
batch_size: 4           # per GPU
gradient_accumulation: 1
max_seq_len: 4096
bf16: true
max_grad_norm: 1.0
warmup_ratio: 0.05
wsd_decay_start: 1.0    # flat LR, no decay
muon_momentum: 0.95
muon_weight_decay: 0.01
muon_ns_iterations: 5
muon_ns_coefficients: gram_ns
adamw_betas: [0.9, 0.95]
adamw_weight_decay: 0.1
packed: true             # FFD bin-packed with block-diagonal SDPA masks
attn_res: true
attn_res_boundaries: [0, 3, 7, 12, 21, 25]
```

AttnRes routing weights were **frozen** during SFT -- only the base model parameters were updated. The Muon optimizer handles all 2D weight matrices; AdamW handles embeddings, layer norms, and AttnRes parameters.

### Hardware

- 8x GPU (B200), single node
- HuggingFace Trainer (not DDP torchrun)
- ~90K tokens/sec throughput
- ~98 GiB GPU memory allocated
- Async checkpoints with SHM staging

## Variants

The 2x2 comparison (2 base checkpoints x SFT) reveals a clear tradeoff:

### Fullcorpus Instruct (`fc-instruct.pt`)

- **Lower eval loss** (2.401 vs 2.517) -- 4.6% advantage
- **Higher im_end@1** (0.446 vs 0.397) -- better turn boundary prediction
- **Less overfit** (gap 0.070 vs 0.102, ratio 1.030 vs 1.042)
- Recommended as the primary instruct variant

### Books-CPT Instruct (`bcpt-instruct.pt`)

- **Substantially lower gradient norm variance** -- more stable and uniform gradient flow across layers throughout training
- **1.41x faster eval loss descent** (mean slope -0.00321 vs -0.00227) -- learns the SFT objective more efficiently per step
- Higher absolute loss reflects the books-CPT base starting from a different loss surface (books domain shift), not SFT quality
- The gradient uniformity advantage from books continued pretraining survives SFT intact

## Evaluation

### Fullcorpus Instruct — Training Trajectory

| Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
|------|-----------|----------|-------------|---------------|----------|----------|
| 25 | 2.557 | 12.90 | -0.158 | 0.942 | 0.092 | 0.337 |
| 50 | 2.477 | 11.91 | -0.122 | 0.953 | 0.337 | 0.538 |
| 75 | 2.433 | 11.39 | 0.057 | 1.024 | 0.370 | 0.543 |
| 100 | **2.401** | **11.04** | 0.070 | 1.030 | 0.386 | 0.543 |
| 110 | -- | -- | -- | -- | **0.446** | 0.554 |
| 120 | -- | -- | -- | -- | 0.424 | 0.560 |

Best eval loss: **2.401** at step 100. Best im_end@1: **0.446** at step 110.

### Books-CPT Instruct — Training Trajectory

| Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
|------|-----------|----------|-------------|---------------|----------|----------|
| 25 | 2.737 | 15.44 | -0.126 | 0.956 | 0.033 | 0.245 |
| 50 | 2.625 | 13.80 | -0.076 | 0.972 | 0.304 | 0.500 |
| 75 | 2.561 | 12.95 | 0.091 | 1.037 | 0.348 | 0.522 |
| 100 | **2.517** | **12.39** | 0.102 | 1.042 | 0.359 | 0.543 |
| 110 | -- | -- | -- | -- | 0.391 | 0.560 |
| 120 | -- | -- | -- | -- | **0.397** | **0.576** |

Best eval loss: **2.517** at step 100. Best im_end@1: **0.397** at step 120.

### Metric Definitions

- **im_end@1 / im_end@5:** Top-1 / top-5 accuracy of predicting the `<|im_end|>` token at actual turn boundaries in the eval set. Measures whether the model has learned when to stop generating within a turn.
- **im_start@1:** Top-1 accuracy for `<|im_start|>` prediction. Near-zero for both variants (0.0 in most checkpoints), indicating the model has not learned to predict turn-start tokens -- expected given the small SFT corpus and the fact that turn starts are mostly predictable from context.
- **Overfit gap:** train_loss - eval_loss. Positive values indicate overfitting.
- **Overfit ratio:** train_loss / eval_loss. Values above 1.0 indicate overfitting.

## SFT Sweep Results

### Fullcorpus Base (12 configs)

| Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
|--------|-----------|--------|-------|-----------|----------|---------------|
| lr1e-02-ep1 | 0.01 | 1 | 62 | 2.409 | 0.429 | 0.971 |
| lr1e-02-ep2 | 0.01 | 2 | 124 | 2.358 | 0.418 | 1.121 |
| lr1e-02-ep3 | 0.01 | 3 | 186 | 2.373 | 0.391 | 1.292 |
| lr1e-03-ep1 | 0.001 | 1 | 62 | 2.569 | 0.168 | 0.942 |
| lr1e-03-ep2 | 0.001 | 2 | 124 | 2.496 | 0.332 | 0.992 |
| lr1e-03-ep3 | 0.001 | 3 | 186 | 2.440 | 0.397 | 1.023 |
| **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.474 | 0.364 | 0.954 |
| **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.401** | **0.424** | **1.030** |
| lr3e-03-ep3 | 0.003 | 3 | 186 | 2.352 | 0.413 | 1.094 |
| lr3e-04-ep1 | 0.0003 | 1 | 62 | 2.679 | 0.022 | 0.939 |
| lr3e-04-ep2 | 0.0003 | 2 | 124 | 2.615 | 0.071 | 0.974 |
| lr3e-04-ep3 | 0.0003 | 3 | 186 | 2.556 | 0.163 | 0.992 |

lr=0.01 achieves the lowest absolute eval loss (2.358 at 2 epochs) but with severe overfitting (ratio 1.12). lr=3e-4 learns too slowly -- im_end@1 barely reaches 0.16 even at 3 epochs. **lr=3e-3 at 2 epochs** is the Pareto optimum: strong eval loss (2.401) and im_end accuracy (0.424) with controlled overfitting (1.03).

### Books-CPT Base (6 configs)

| Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
|--------|-----------|--------|-------|-----------|----------|---------------|
| lr1e-02-ep1 | 0.01 | 1 | 62 | 2.506 | 0.424 | 0.986 |
| lr1e-02-ep2 | 0.01 | 2 | 124 | 2.426 | 0.424 | 1.124 |
| lr1e-03-ep1 | 0.001 | 1 | 62 | 2.758 | 0.076 | 0.964 |
| lr1e-03-ep2 | 0.001 | 2 | 124 | 2.657 | 0.293 | 1.010 |
| **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.620 | 0.353 | 0.972 |
| **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.517** | **0.397** | **1.042** |

Same pattern as fullcorpus: lr=3e-3 at 2 epochs is the best balance. lr=0.01 overfits aggressively by epoch 2.

## Usage

### Loading

```python
import torch
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig

# Build config (proxy architecture)
config = LuxiaModelConfig(
    hidden_size=512,
    num_layers=28,
    num_attention_heads=4,
    num_kv_heads=2,
    head_dim=128,
    intermediate_size=1408,
    vocab_size=49152,
    max_position_embeddings=4096,
    rope_theta=500000.0,
    tie_word_embeddings=True,
    attn_res=True,
    attn_res_boundaries=[0, 3, 7, 12, 21, 25],
)

model = LuxiaBaseModel(config)
state_dict = torch.load("fc-instruct.pt", map_location="cpu")
model.load_state_dict(state_dict["model"])
model.eval()
```

### Chat Template (ChatML)

```
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
```

The model uses SmolLM2's tokenizer (`HuggingFaceTB/SmolLM2-135M`) with ChatML special tokens:
- `<|im_start|>` = token 1
- `<|im_end|>` = token 2
- `<|endoftext|>` = token 0 (pad token)

### Inference Example

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

messages = [
    {"role": "user", "content": "Tell me about the nature of consciousness."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Autoregressive generation (LuxiaBaseModel has no .generate())
generated = input_ids.to(model.embed_tokens.weight.device)
with torch.no_grad():
    for _ in range(512):
        out = model(input_ids=generated)
        logits = out["logits"][:, -1, :] / 0.8
        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        generated = torch.cat([generated, next_token], dim=1)
        if next_token.item() == 2:  # <|im_end|>
            break

print(tokenizer.decode(generated[0], skip_special_tokens=False))
```

**Sampling note:** At 108M scale, avoid top-p sampling -- it catastrophically degrades output quality. Use pure temperature sampling only.

## Limitations

- **108M scale.** This is a proxy-scale research model. It demonstrates that the architecture and training pipeline work, but the model's generation quality is fundamentally limited by parameter count. It is not suitable for production use.
- **Not assistant-shaped.** The model has learned turn-taking structure but has not been trained to be helpful, harmless, or honest. It may produce incoherent, offensive, or factually incorrect outputs.
- **HF Trainer throughput.** The SFT sweep used HuggingFace Trainer rather than the pretraining DDP pipeline. This was a pragmatic choice for sweep automation but means throughput (~90K tok/s) is below what the custom DDP trainer achieves.
- **No geometric probes in sweep.** The pretraining pipeline includes geometric monitoring (intrinsic dimension, stable rank, attention entropy). These were not instrumented in the SFT sweep, so we cannot directly measure whether SFT preserves the geometric properties of the base models.
- **im_start accuracy near zero.** The model reliably learns to predict turn-end tokens but not turn-start tokens. This is likely a consequence of the small SFT corpus size and the high predictability of turn-start positions from context.
- **Small SFT corpus.** At 8.1M tokens (5.5M trainable), the SFT dataset is deliberately minimal. This is sufficient for learning turn structure but not for deep behavioral fine-tuning.

## Links

- **Base model:** [aethera-gp/kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base)
- **Training code:** [github.com/aethera-gp/kotodama](https://github.com/aethera-gp/kotodama) (posttraining/)
- **Wandb project:** [aethera/kotodama-sft-sweep](https://wandb.ai/aethera/kotodama-sft-sweep)
- **SFT data sources:**
  - [Infinite Backrooms](https://dreams-of-an-electric-mind.webflow.io/) by @andyayrey
  - [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2)