LuxiaSL commited on
Commit
fbfdc52
·
verified ·
1 Parent(s): 12e0ad1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +294 -0
README.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ tags:
7
+ - causal-lm
8
+ - sft
9
+ - chatml
10
+ - attention-residuals
11
+ - muon
12
+ - research
13
+ base_model: aethera-gp/kotodama-108m-base
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # Kotodama 108M Instruct
18
+
19
+ A 108M parameter instruction-tuned transformer trained with full-parameter SFT on freeform dialogue data. This is **not** an assistant-shaped model -- the SFT objective is learning turn structure (ChatML delimiters, turn-taking cadence) rather than instruction-following or helpfulness optimization.
20
+
21
+ Two instruct variants are provided, corresponding to SFT on each of the two base model checkpoints:
22
+
23
+ | Variant | File | Base | Eval Loss | im_end@1 | Overfit Gap |
24
+ |---------|------|------|-----------|----------|-------------|
25
+ | **Fullcorpus Instruct** | `fc-instruct.pt` | fullcorpus-ddv1 step 81252 | **2.401** | **0.446** | 0.070 |
26
+ | **Books-CPT Instruct** | `bcpt-instruct.pt` | books-cpt step 17336 | 2.517 | 0.397 | 0.102 |
27
+
28
+ Both were trained with identical SFT data and hyperparameters. The fullcorpus variant wins on absolute metrics; the books-CPT variant exhibits superior gradient properties during training.
29
+
30
+ ## Base Model
31
+
32
+ Both variants build on [kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base), a from-scratch Llama-family transformer trained with the Muon optimizer and Block Attention Residuals (AttnRes).
33
+
34
+ **Proxy architecture (108M):**
35
+
36
+ | Parameter | Value |
37
+ |-----------|-------|
38
+ | d_model | 512 |
39
+ | n_layers | 28 |
40
+ | Query heads | 4 |
41
+ | KV heads | 2 (GQA 2:1) |
42
+ | head_dim | 128 |
43
+ | FFN intermediate | 1408 (SwiGLU) |
44
+ | Vocab size | 49152 (SmolLM2 tokenizer) |
45
+ | Max position | 4096 (RoPE, theta=500K) |
46
+ | Normalization | RMSNorm + QK-norm |
47
+ | Tied embeddings | Yes |
48
+ | Bias | None |
49
+ | z-loss | 1e-5 |
50
+ | AttnRes | DD-v1, boundaries [0, 3, 7, 12, 21, 25] |
51
+
52
+ The fullcorpus base was pretrained on 170.4B tokens (13 sources, academic/code-reasoning/math/legal/books/conversation). The books-CPT variant continued pretraining on 36.4B tokens of public domain books (Common Pile: Internet Archive, Library of Congress, DOAB).
53
+
54
+ ## SFT Data
55
+
56
+ 8.1M total tokens (5.5M trainable, 68.4% trainable ratio), 6,187 conversations split 90/10 train/eval. Pretokenized with ChatML template, chunked at turn boundaries to fit 4096 seq_len, packed with first-fit-decreasing bin packing into 1,976 fixed-length bins.
57
+
58
+ | Source | Conversations | Est. Tokens | Description |
59
+ |--------|--------------|-------------|-------------|
60
+ | Infinite Backrooms | 821 | ~5.7M | Model-to-model freeform dialogue between Claude instances |
61
+ | OASST2 top-ranked | 5,366 | ~2.8M | Human multi-turn conversations (rank==0, English only) |
62
+
63
+ **Data philosophy.** The SFT corpus is deliberately composed of freeform dialogue rather than instruction-following data. Infinite Backrooms conversations (scraped from dreams-of-an-electric-mind.webflow.io) capture two Claude instances in unstructured, extended conversation across 19 scenario types and multiple model generations (Opus 3, Sonnet 3.5, Opus 4, Sonnet 4, Sonnet 4.5). OASST2 contributes genuine human conversational patterns via the highest-ranked response path through each conversation tree.
64
+
65
+ Explicitly excluded: Alpaca, SlimOrca, UltraChat (too assistant-shaped), ShareGPT/WildChat (noisy, refusal artifacts), SODA/SmolTalk (already in pretraining data).
66
+
67
+ **Processing details:**
68
+ - Backrooms: actor names discovered dynamically per conversation; first speaker mapped to `user`, second to `assistant`. OOC preamble stripped, ANSI escape codes stripped, conversations with fewer than 3 turns dropped.
69
+ - OASST2: re-extracted from HuggingFace raw data (not the curation pipeline output, which lost rank metadata). Follows rank==0 path at each tree branch.
70
+ - Chunking: 571 conversations exceeded 4096 tokens and were split at turn boundaries into 1,705 non-overlapping chunks. Only 1/6,703 examples was truncated after chunking.
71
+
72
+ ## Training
73
+
74
+ ### Hyperparameter Sweep
75
+
76
+ A 18-config sweep was run across both base models: 12 configs for fullcorpus (4 learning rates x 3 epoch counts) and 6 for books-CPT (3 learning rates x 2 epoch counts). All configs used flat LR schedule (warmup 5%, `wsd_decay_start: 1.0` -- no decay phase).
77
+
78
+ **Winner for both bases:** Muon lr=3e-3 (AdamW lr=3e-4), 2 epochs.
79
+
80
+ Selection criteria: lowest eval loss with overfit ratio below 1.05 and highest im_end@1 among non-overfitting configs.
81
+
82
+ ### Winning Config
83
+
84
+ ```yaml
85
+ # Shared across both variants
86
+ muon_lr: 0.003
87
+ adamw_lr: 0.0003
88
+ num_epochs: 2
89
+ batch_size: 4 # per GPU
90
+ gradient_accumulation: 1
91
+ max_seq_len: 4096
92
+ bf16: true
93
+ max_grad_norm: 1.0
94
+ warmup_ratio: 0.05
95
+ wsd_decay_start: 1.0 # flat LR, no decay
96
+ muon_momentum: 0.95
97
+ muon_weight_decay: 0.01
98
+ muon_ns_iterations: 5
99
+ muon_ns_coefficients: gram_ns
100
+ adamw_betas: [0.9, 0.95]
101
+ adamw_weight_decay: 0.1
102
+ packed: true # FFD bin-packed with block-diagonal SDPA masks
103
+ attn_res: true
104
+ attn_res_boundaries: [0, 3, 7, 12, 21, 25]
105
+ ```
106
+
107
+ AttnRes routing weights were **frozen** during SFT -- only the base model parameters were updated. The Muon optimizer handles all 2D weight matrices; AdamW handles embeddings, layer norms, and AttnRes parameters.
108
+
109
+ ### Hardware
110
+
111
+ - 8x GPU (B200), single node
112
+ - HuggingFace Trainer (not DDP torchrun)
113
+ - ~90K tokens/sec throughput
114
+ - ~98 GiB GPU memory allocated
115
+ - Async checkpoints with SHM staging
116
+
117
+ ## Variants
118
+
119
+ The 2x2 comparison (2 base checkpoints x SFT) reveals a clear tradeoff:
120
+
121
+ ### Fullcorpus Instruct (`fc-instruct.pt`)
122
+
123
+ - **Lower eval loss** (2.401 vs 2.517) -- 4.6% advantage
124
+ - **Higher im_end@1** (0.446 vs 0.397) -- better turn boundary prediction
125
+ - **Less overfit** (gap 0.070 vs 0.102, ratio 1.030 vs 1.042)
126
+ - Recommended as the primary instruct variant
127
+
128
+ ### Books-CPT Instruct (`bcpt-instruct.pt`)
129
+
130
+ - **Substantially lower gradient norm variance** -- more stable and uniform gradient flow across layers throughout training
131
+ - **1.41x faster eval loss descent** (mean slope -0.00321 vs -0.00227) -- learns the SFT objective more efficiently per step
132
+ - Higher absolute loss reflects the books-CPT base starting from a different loss surface (books domain shift), not SFT quality
133
+ - The gradient uniformity advantage from books continued pretraining survives SFT intact
134
+
135
+ ## Evaluation
136
+
137
+ ### Fullcorpus Instruct — Training Trajectory
138
+
139
+ | Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
140
+ |------|-----------|----------|-------------|---------------|----------|----------|
141
+ | 25 | 2.557 | 12.90 | -0.158 | 0.942 | 0.092 | 0.337 |
142
+ | 50 | 2.477 | 11.91 | -0.122 | 0.953 | 0.337 | 0.538 |
143
+ | 75 | 2.433 | 11.39 | 0.057 | 1.024 | 0.370 | 0.543 |
144
+ | 100 | **2.401** | **11.04** | 0.070 | 1.030 | 0.386 | 0.543 |
145
+ | 110 | -- | -- | -- | -- | **0.446** | 0.554 |
146
+ | 120 | -- | -- | -- | -- | 0.424 | 0.560 |
147
+
148
+ Best eval loss: **2.401** at step 100. Best im_end@1: **0.446** at step 110.
149
+
150
+ ### Books-CPT Instruct — Training Trajectory
151
+
152
+ | Step | Eval Loss | Eval PPL | Overfit Gap | Overfit Ratio | im_end@1 | im_end@5 |
153
+ |------|-----------|----------|-------------|---------------|----------|----------|
154
+ | 25 | 2.737 | 15.44 | -0.126 | 0.956 | 0.033 | 0.245 |
155
+ | 50 | 2.625 | 13.80 | -0.076 | 0.972 | 0.304 | 0.500 |
156
+ | 75 | 2.561 | 12.95 | 0.091 | 1.037 | 0.348 | 0.522 |
157
+ | 100 | **2.517** | **12.39** | 0.102 | 1.042 | 0.359 | 0.543 |
158
+ | 110 | -- | -- | -- | -- | 0.391 | 0.560 |
159
+ | 120 | -- | -- | -- | -- | **0.397** | **0.576** |
160
+
161
+ Best eval loss: **2.517** at step 100. Best im_end@1: **0.397** at step 120.
162
+
163
+ ### Metric Definitions
164
+
165
+ - **im_end@1 / im_end@5:** Top-1 / top-5 accuracy of predicting the `<|im_end|>` token at actual turn boundaries in the eval set. Measures whether the model has learned when to stop generating within a turn.
166
+ - **im_start@1:** Top-1 accuracy for `<|im_start|>` prediction. Near-zero for both variants (0.0 in most checkpoints), indicating the model has not learned to predict turn-start tokens -- expected given the small SFT corpus and the fact that turn starts are mostly predictable from context.
167
+ - **Overfit gap:** train_loss - eval_loss. Positive values indicate overfitting.
168
+ - **Overfit ratio:** train_loss / eval_loss. Values above 1.0 indicate overfitting.
169
+
170
+ ## SFT Sweep Results
171
+
172
+ ### Fullcorpus Base (12 configs)
173
+
174
+ | Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
175
+ |--------|-----------|--------|-------|-----------|----------|---------------|
176
+ | lr1e-02-ep1 | 0.01 | 1 | 62 | 2.409 | 0.429 | 0.971 |
177
+ | lr1e-02-ep2 | 0.01 | 2 | 124 | 2.358 | 0.418 | 1.121 |
178
+ | lr1e-02-ep3 | 0.01 | 3 | 186 | 2.373 | 0.391 | 1.292 |
179
+ | lr1e-03-ep1 | 0.001 | 1 | 62 | 2.569 | 0.168 | 0.942 |
180
+ | lr1e-03-ep2 | 0.001 | 2 | 124 | 2.496 | 0.332 | 0.992 |
181
+ | lr1e-03-ep3 | 0.001 | 3 | 186 | 2.440 | 0.397 | 1.023 |
182
+ | **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.474 | 0.364 | 0.954 |
183
+ | **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.401** | **0.424** | **1.030** |
184
+ | lr3e-03-ep3 | 0.003 | 3 | 186 | 2.352 | 0.413 | 1.094 |
185
+ | lr3e-04-ep1 | 0.0003 | 1 | 62 | 2.679 | 0.022 | 0.939 |
186
+ | lr3e-04-ep2 | 0.0003 | 2 | 124 | 2.615 | 0.071 | 0.974 |
187
+ | lr3e-04-ep3 | 0.0003 | 3 | 186 | 2.556 | 0.163 | 0.992 |
188
+
189
+ lr=0.01 achieves the lowest absolute eval loss (2.358 at 2 epochs) but with severe overfitting (ratio 1.12). lr=3e-4 learns too slowly -- im_end@1 barely reaches 0.16 even at 3 epochs. **lr=3e-3 at 2 epochs** is the Pareto optimum: strong eval loss (2.401) and im_end accuracy (0.424) with controlled overfitting (1.03).
190
+
191
+ ### Books-CPT Base (6 configs)
192
+
193
+ | Config | LR (Muon) | Epochs | Steps | Eval Loss | im_end@1 | Overfit Ratio |
194
+ |--------|-----------|--------|-------|-----------|----------|---------------|
195
+ | lr1e-02-ep1 | 0.01 | 1 | 62 | 2.506 | 0.424 | 0.986 |
196
+ | lr1e-02-ep2 | 0.01 | 2 | 124 | 2.426 | 0.424 | 1.124 |
197
+ | lr1e-03-ep1 | 0.001 | 1 | 62 | 2.758 | 0.076 | 0.964 |
198
+ | lr1e-03-ep2 | 0.001 | 2 | 124 | 2.657 | 0.293 | 1.010 |
199
+ | **lr3e-03-ep1** | **0.003** | **1** | **62** | 2.620 | 0.353 | 0.972 |
200
+ | **lr3e-03-ep2** | **0.003** | **2** | **124** | **2.517** | **0.397** | **1.042** |
201
+
202
+ Same pattern as fullcorpus: lr=3e-3 at 2 epochs is the best balance. lr=0.01 overfits aggressively by epoch 2.
203
+
204
+ ## Usage
205
+
206
+ ### Loading
207
+
208
+ ```python
209
+ import torch
210
+ from src.model.llama import LuxiaBaseModel, LuxiaModelConfig
211
+
212
+ # Build config (proxy architecture)
213
+ config = LuxiaModelConfig(
214
+ hidden_size=512,
215
+ num_layers=28,
216
+ num_attention_heads=4,
217
+ num_kv_heads=2,
218
+ head_dim=128,
219
+ intermediate_size=1408,
220
+ vocab_size=49152,
221
+ max_position_embeddings=4096,
222
+ rope_theta=500000.0,
223
+ tie_word_embeddings=True,
224
+ attn_res=True,
225
+ attn_res_boundaries=[0, 3, 7, 12, 21, 25],
226
+ )
227
+
228
+ model = LuxiaBaseModel(config)
229
+ state_dict = torch.load("fc-instruct.pt", map_location="cpu")
230
+ model.load_state_dict(state_dict["model"])
231
+ model.eval()
232
+ ```
233
+
234
+ ### Chat Template (ChatML)
235
+
236
+ ```
237
+ <|im_start|>user
238
+ Hello, how are you?<|im_end|>
239
+ <|im_start|>assistant
240
+ ```
241
+
242
+ The model uses SmolLM2's tokenizer (`HuggingFaceTB/SmolLM2-135M`) with ChatML special tokens:
243
+ - `<|im_start|>` = token 1
244
+ - `<|im_end|>` = token 2
245
+ - `<|endoftext|>` = token 0 (pad token)
246
+
247
+ ### Inference Example
248
+
249
+ ```python
250
+ from transformers import AutoTokenizer
251
+
252
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
253
+
254
+ messages = [
255
+ {"role": "user", "content": "Tell me about the nature of consciousness."},
256
+ ]
257
+
258
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
259
+ input_ids = tokenizer.encode(prompt, return_tensors="pt")
260
+
261
+ # Autoregressive generation (LuxiaBaseModel has no .generate())
262
+ generated = input_ids.to(model.embed_tokens.weight.device)
263
+ with torch.no_grad():
264
+ for _ in range(512):
265
+ out = model(input_ids=generated)
266
+ logits = out["logits"][:, -1, :] / 0.8
267
+ probs = torch.softmax(logits, dim=-1)
268
+ next_token = torch.multinomial(probs, num_samples=1)
269
+ generated = torch.cat([generated, next_token], dim=1)
270
+ if next_token.item() == 2: # <|im_end|>
271
+ break
272
+
273
+ print(tokenizer.decode(generated[0], skip_special_tokens=False))
274
+ ```
275
+
276
+ **Sampling note:** At 108M scale, avoid top-p sampling -- it catastrophically degrades output quality. Use pure temperature sampling only.
277
+
278
+ ## Limitations
279
+
280
+ - **108M scale.** This is a proxy-scale research model. It demonstrates that the architecture and training pipeline work, but the model's generation quality is fundamentally limited by parameter count. It is not suitable for production use.
281
+ - **Not assistant-shaped.** The model has learned turn-taking structure but has not been trained to be helpful, harmless, or honest. It may produce incoherent, offensive, or factually incorrect outputs.
282
+ - **HF Trainer throughput.** The SFT sweep used HuggingFace Trainer rather than the pretraining DDP pipeline. This was a pragmatic choice for sweep automation but means throughput (~90K tok/s) is below what the custom DDP trainer achieves.
283
+ - **No geometric probes in sweep.** The pretraining pipeline includes geometric monitoring (intrinsic dimension, stable rank, attention entropy). These were not instrumented in the SFT sweep, so we cannot directly measure whether SFT preserves the geometric properties of the base models.
284
+ - **im_start accuracy near zero.** The model reliably learns to predict turn-end tokens but not turn-start tokens. This is likely a consequence of the small SFT corpus size and the high predictability of turn-start positions from context.
285
+ - **Small SFT corpus.** At 8.1M tokens (5.5M trainable), the SFT dataset is deliberately minimal. This is sufficient for learning turn structure but not for deep behavioral fine-tuning.
286
+
287
+ ## Links
288
+
289
+ - **Base model:** [aethera-gp/kotodama-108m-base](https://huggingface.co/aethera-gp/kotodama-108m-base)
290
+ - **Training code:** [github.com/aethera-gp/kotodama](https://github.com/aethera-gp/kotodama) (posttraining/)
291
+ - **Wandb project:** [aethera/kotodama-sft-sweep](https://wandb.ai/aethera/kotodama-sft-sweep)
292
+ - **SFT data sources:**
293
+ - [Infinite Backrooms](https://dreams-of-an-electric-mind.webflow.io/) by @andyayrey
294
+ - [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2)