YatNMN-Softplus d=12 + Fineweb-Edu continued pretraining (261M)

A 261M-parameter nanochat-architecture GPT with YatNMN-Softplus MLP continued-pretrained on HuggingFaceFW/fineweb-edu for 5.2 B additional tokens, starting from mlnomad/yatnmn-softplus-d12-chinchilla-261M (C4 Chinchilla).

Held-out loss evaluation (n=30 docs Γ— 1024 tokens)

Compared against the base model (C4 Chinchilla only):

Dataset Base (C4) This model Ξ”
wikitext-103 4.2207 3.7283 βˆ’0.49 πŸ†
C4 2.9948 3.1443 +0.15
fineweb-edu 3.1751 2.8385 βˆ’0.34 πŸ†
  • βˆ’0.49 on wikitext-103 β€” large improvement on formal/encyclopedic text that this model was not directly trained on (generalization win)
  • βˆ’0.34 on fineweb-edu β€” direct improvement on the continuation distribution
  • +0.15 on C4 β€” mild regression on the base distribution (classic fine-tuning trade-off; the model gave up a bit of web-scrape boilerplate to specialize toward cleaner text)

Qualitative comparison β€” generation samples

Prompt: "Photosynthesis is the process by which" (temp=0.8, 48 tokens)

Model Generation
Base (C4) "all the images come to life. This is done when the light reflecting off the image is applied on the surface of the lens..." ❌
This model "plants convert sunlight into chemical energy. This energy is stored as long term, stored in the form of a chemical reaction..." βœ…

The continuation model learned actual photosynthesis.

Training

Architecture Nanochat-style GPT with YatNMN-Softplus MLP (softplus bias, learnable epsilon, learnable alpha)
Parameters 261,133,214
Config d=12, n_embd=768, n_head=12, n_kv_head=12, seq_len=1024, tied embeddings, SSSL window
Base checkpoint mlnomad/yatnmn-softplus-d12-chinchilla-261M (loss 2.98 on C4 after Chinchilla-20Γ— = 5.22 B tokens)
Continuation data HuggingFaceFW/fineweb-edu (sample-10BT)
Continuation length 20,000 steps Γ— 262,144 tokens = 5.24 B extra tokens
Optimizer plain AdamW, weight_decay=0.01, grad_clip_global_norm=1.0
LR schedule warmup-cosine-decay, peak lr=3e-3 (10Γ— smaller than base 0.03), end lr=3e-4, warmup=200
Batch 32/device Γ— 8 devices = 256 global
Seq length 1024
Hardware TPU v6e-8 (TRC), europe-west4-a
Wall time 2.2 h
Seed 0
Wandb irf-sic/flaxchat β€” yatnmn-softplus-d12-continue-fineweb-edu-seed0

The optimizer state (Adam m/v moments + step count) is preserved in this checkpoint, so training can be resumed exactly.

Contents

.
β”œβ”€β”€ 20000/                    # final Orbax checkpoint
β”‚   β”œβ”€β”€ _CHECKPOINT_METADATA
β”‚   β”œβ”€β”€ metadata/             # JSON: {"loss": 2.9378, "smooth": 2.8993, "resumed_from": ...}
β”‚   β”œβ”€β”€ model/                # nnx.Param state β€” architecture weights
β”‚   └── optimizer/            # optax AdamW state (m/v + step count + LR state)
β”œβ”€β”€ config.json               # full architecture + training config
β”œβ”€β”€ README.md                 # this file
└── code/                     # reference snapshots of flaxchat code (source of truth: github repo)
    β”œβ”€β”€ gpt.py                  # GPT architecture
    β”œβ”€β”€ checkpoint.py           # Orbax save/restore (includes optimizer state)
    β”œβ”€β”€ config.py               # GPTConfig + FlaxChatConfig
    β”œβ”€β”€ train_d12_chinchilla.py # original pretraining script
    β”œβ”€β”€ continue_yatnmn_softplus_fineweb.py  # the continuation script used here
    └── load_model.py           # minimal loader to run after cloning flaxchat

Loading

Exactly the same pattern as the base model β€” clone flaxchat and use its code:

git clone https://github.com/mlnomadpy/flaxchat && cd flaxchat
pixi install   # or pip install as in the base model's README
huggingface-cli download mlnomad/yatnmn-softplus-d12-fineweb-edu-261M --local-dir ./hf_model
python code/load_model.py ./hf_model/20000

See the base model's README loading section for full detail. The only difference is the path β€” point at ./hf_model/20000/ instead of ./hf_model/19922/.

License

Apache 2.0.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlnomad/yatnmn-softplus-d12-fineweb-edu-261M

Finetuned
(2)
this model

Datasets used to train mlnomad/yatnmn-softplus-d12-fineweb-edu-261M