YatNMN-Softplus + scalar_bias + constant α=1 d=12 Chinchilla (261M) — PyTorch

A 261M-parameter nanochat-architecture GPT with YatNMN-Softplus MLP where both α is fixed at 1 and bias is a single shared scalar (1,). This is the minimal YatNMN ablation — the pure (x·W+b)²/(||x−W||²+ε) formula without learnable per-layer scaling or per-neuron bias.

Result: the bare YatNMN formula alone barely beats GELU (−0.03 nats on C4). The 0.14-nat advantage of full YatNMN-Softplus comes from the synergy between per-neuron bias and learnable α, not from the formula itself.

Ablation table (d=12, 261M, Chinchilla 20×, 3-seed mean)

Variant	C4 smooth	wikitext PPL	vs GELU
YatNMN per-neuron + learnable α	2.98	40.15	−0.14
YatNMN scalar_bias + learnable α	3.06	39.53	−0.06
YatNMN per-neuron + constant α=1	3.10	67.09	−0.02
YatNMN sb + constant α=1 (this)	3.09	78.34	−0.03
GELU	3.12	46.52	baseline

Quick start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mlnomad/yatnmn-softplus-sb-ca-d12-chinchilla-261M-pytorch",
    trust_remote_code=True, dtype=torch.float32,
).eval()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
with torch.no_grad():
    out = model.generate(ids, max_new_tokens=50, do_sample=False,
                         use_cache=True, pad_token_id=tokenizer.eos_token_id or 0)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Config

Scalar (1,) bias + softplus_bias + learnable_epsilon + constant_alpha=True (α=1 fixed).


Parameters	261,096,362
Final smooth loss	3.08 (3-seed mean 3.09 ± 0.01)
Wikitext-103 PPL	78.34
Training data	`allenai/c4`, 5.22 B tokens (Chinchilla 20×)
Hardware	TPU v6e-8, europe-west4-a

mlnomad/yatnmn-softplus-sb-d12-chinchilla-261M-pytorch — with learnable α (loss 3.06, PPL 39.5)
mlnomad/yatnmn-softplus-ca-d12-chinchilla-261M-pytorch — per-neuron bias + constant α
mlnomad/gelu-d12-chinchilla-261M-pytorch — GELU baseline

License

Apache 2.0.

Downloads last month: 936

Safetensors

Model size

0.3B params

Tensor type

F32

mlnomad
/

yatnmn-softplus-sb-ca-d12-chinchilla-261M-pytorch

YatNMN-Softplus + scalar_bias + constant α=1 d=12 Chinchilla (261M) — PyTorch

Ablation table (d=12, 261M, Chinchilla 20×, 3-seed mean)

Quick start

Config

Related

License

Dataset used to train mlnomad/yatnmn-softplus-sb-ca-d12-chinchilla-261M-pytorch

Space using mlnomad/yatnmn-softplus-sb-ca-d12-chinchilla-261M-pytorch 1