YatNMN-Softplus d=12 + Fineweb-Edu continued pretraining (261M)
A 261M-parameter nanochat-architecture GPT with YatNMN-Softplus MLP continued-pretrained on HuggingFaceFW/fineweb-edu for 5.2 B additional tokens, starting from mlnomad/yatnmn-softplus-d12-chinchilla-261M (C4 Chinchilla).
Held-out loss evaluation (n=30 docs Γ 1024 tokens)
Compared against the base model (C4 Chinchilla only):
| Dataset | Base (C4) | This model | Ξ |
|---|---|---|---|
| wikitext-103 | 4.2207 | 3.7283 | β0.49 π |
| C4 | 2.9948 | 3.1443 | +0.15 |
| fineweb-edu | 3.1751 | 2.8385 | β0.34 π |
- β0.49 on wikitext-103 β large improvement on formal/encyclopedic text that this model was not directly trained on (generalization win)
- β0.34 on fineweb-edu β direct improvement on the continuation distribution
- +0.15 on C4 β mild regression on the base distribution (classic fine-tuning trade-off; the model gave up a bit of web-scrape boilerplate to specialize toward cleaner text)
Qualitative comparison β generation samples
Prompt: "Photosynthesis is the process by which" (temp=0.8, 48 tokens)
| Model | Generation |
|---|---|
| Base (C4) | "all the images come to life. This is done when the light reflecting off the image is applied on the surface of the lens..." β |
| This model | "plants convert sunlight into chemical energy. This energy is stored as long term, stored in the form of a chemical reaction..." β
|
The continuation model learned actual photosynthesis.
Training
| Architecture | Nanochat-style GPT with YatNMN-Softplus MLP (softplus bias, learnable epsilon, learnable alpha) |
| Parameters | 261,133,214 |
| Config | d=12, n_embd=768, n_head=12, n_kv_head=12, seq_len=1024, tied embeddings, SSSL window |
| Base checkpoint | mlnomad/yatnmn-softplus-d12-chinchilla-261M (loss 2.98 on C4 after Chinchilla-20Γ = 5.22 B tokens) |
| Continuation data | HuggingFaceFW/fineweb-edu (sample-10BT) |
| Continuation length | 20,000 steps Γ 262,144 tokens = 5.24 B extra tokens |
| Optimizer | plain AdamW, weight_decay=0.01, grad_clip_global_norm=1.0 |
| LR schedule | warmup-cosine-decay, peak lr=3e-3 (10Γ smaller than base 0.03), end lr=3e-4, warmup=200 |
| Batch | 32/device Γ 8 devices = 256 global |
| Seq length | 1024 |
| Hardware | TPU v6e-8 (TRC), europe-west4-a |
| Wall time | 2.2 h |
| Seed | 0 |
| Wandb | irf-sic/flaxchat β yatnmn-softplus-d12-continue-fineweb-edu-seed0 |
The optimizer state (Adam m/v moments + step count) is preserved in this checkpoint, so training can be resumed exactly.
Contents
.
βββ 20000/ # final Orbax checkpoint
β βββ _CHECKPOINT_METADATA
β βββ metadata/ # JSON: {"loss": 2.9378, "smooth": 2.8993, "resumed_from": ...}
β βββ model/ # nnx.Param state β architecture weights
β βββ optimizer/ # optax AdamW state (m/v + step count + LR state)
βββ config.json # full architecture + training config
βββ README.md # this file
βββ code/ # reference snapshots of flaxchat code (source of truth: github repo)
βββ gpt.py # GPT architecture
βββ checkpoint.py # Orbax save/restore (includes optimizer state)
βββ config.py # GPTConfig + FlaxChatConfig
βββ train_d12_chinchilla.py # original pretraining script
βββ continue_yatnmn_softplus_fineweb.py # the continuation script used here
βββ load_model.py # minimal loader to run after cloning flaxchat
Loading
Exactly the same pattern as the base model β clone flaxchat and use its code:
git clone https://github.com/mlnomadpy/flaxchat && cd flaxchat
pixi install # or pip install as in the base model's README
huggingface-cli download mlnomad/yatnmn-softplus-d12-fineweb-edu-261M --local-dir ./hf_model
python code/load_model.py ./hf_model/20000
See the base model's README loading section for full detail. The only difference is the path β point at ./hf_model/20000/ instead of ./hf_model/19922/.
License
Apache 2.0.
- Downloads last month
- 14
Model tree for mlnomad/yatnmn-softplus-d12-fineweb-edu-261M
Base model
mlnomad/yatnmn-softplus-d12-chinchilla-261M