YatNMN-Softplus d=12 + Fineweb-Edu continued pretraining (261M)

A 261M-parameter nanochat-architecture GPT with YatNMN-Softplus MLP continued-pretrained on HuggingFaceFW/fineweb-edu for 5.2 B additional tokens, starting from mlnomad/yatnmn-softplus-d12-chinchilla-261M (C4 Chinchilla).

Held-out loss evaluation (n=30 docs × 1024 tokens)

Compared against the base model (C4 Chinchilla only):

Dataset	Base (C4)	This model	Δ
wikitext-103	4.2207	3.7283	−0.49 🏆
C4	2.9948	3.1443	+0.15
fineweb-edu	3.1751	2.8385	−0.34 🏆

−0.49 on wikitext-103 — large improvement on formal/encyclopedic text that this model was not directly trained on (generalization win)
−0.34 on fineweb-edu — direct improvement on the continuation distribution
+0.15 on C4 — mild regression on the base distribution (classic fine-tuning trade-off; the model gave up a bit of web-scrape boilerplate to specialize toward cleaner text)

Qualitative comparison — generation samples

Prompt: "Photosynthesis is the process by which" (temp=0.8, 48 tokens)

Model	Generation
Base (C4)	`"all the images come to life. This is done when the light reflecting off the image is applied on the surface of the lens..."` ❌
This model	`"plants convert sunlight into chemical energy. This energy is stored as long term, stored in the form of a chemical reaction..."` ✅

The continuation model learned actual photosynthesis.

Training


Architecture	Nanochat-style GPT with YatNMN-Softplus MLP (softplus bias, learnable epsilon, learnable alpha)
Parameters	261,133,214
Config	d=12, n_embd=768, n_head=12, n_kv_head=12, seq_len=1024, tied embeddings, SSSL window
Base checkpoint	`mlnomad/yatnmn-softplus-d12-chinchilla-261M` (loss 2.98 on C4 after Chinchilla-20× = 5.22 B tokens)
Continuation data	`HuggingFaceFW/fineweb-edu` (sample-10BT)
Continuation length	20,000 steps × 262,144 tokens = 5.24 B extra tokens
Optimizer	plain AdamW, weight_decay=0.01, grad_clip_global_norm=1.0
LR schedule	warmup-cosine-decay, peak lr=3e-3 (10× smaller than base 0.03), end lr=3e-4, warmup=200
Batch	32/device × 8 devices = 256 global
Seq length	1024
Hardware	TPU v6e-8 (TRC), europe-west4-a
Wall time	2.2 h
Seed	0
Wandb	irf-sic/flaxchat — yatnmn-softplus-d12-continue-fineweb-edu-seed0

The optimizer state (Adam m/v moments + step count) is preserved in this checkpoint, so training can be resumed exactly.

.
├── 20000/                    # final Orbax checkpoint
│   ├── _CHECKPOINT_METADATA
│   ├── metadata/             # JSON: {"loss": 2.9378, "smooth": 2.8993, "resumed_from": ...}
│   ├── model/                # nnx.Param state — architecture weights
│   └── optimizer/            # optax AdamW state (m/v + step count + LR state)
├── config.json               # full architecture + training config
├── README.md                 # this file
└── code/                     # reference snapshots of flaxchat code (source of truth: github repo)
    ├── gpt.py                  # GPT architecture
    ├── checkpoint.py           # Orbax save/restore (includes optimizer state)
    ├── config.py               # GPTConfig + FlaxChatConfig
    ├── train_d12_chinchilla.py # original pretraining script
    ├── continue_yatnmn_softplus_fineweb.py  # the continuation script used here
    └── load_model.py           # minimal loader to run after cloning flaxchat

Loading

Exactly the same pattern as the base model — clone flaxchat and use its code:

git clone https://github.com/mlnomadpy/flaxchat && cd flaxchat
pixi install   # or pip install as in the base model's README
huggingface-cli download mlnomad/yatnmn-softplus-d12-fineweb-edu-261M --local-dir ./hf_model
python code/load_model.py ./hf_model/20000

See the base model's README loading section for full detail. The only difference is the path — point at ./hf_model/20000/ instead of ./hf_model/19922/.

License

Apache 2.0.

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlnomad/yatnmn-softplus-d12-fineweb-edu-261M

Base model

mlnomad/yatnmn-softplus-d12-chinchilla-261M

Finetuned

(2)

this model

mlnomad
/

yatnmn-softplus-d12-fineweb-edu-261M