YatNMN-Softplus + scalar_bias + constant α=1 d=12 Chinchilla (261M) — PyTorch
A 261M-parameter nanochat-architecture GPT with YatNMN-Softplus MLP where both α is fixed at 1 and bias is a single shared scalar (1,). This is the minimal YatNMN ablation — the pure (x·W+b)²/(||x−W||²+ε) formula without learnable per-layer scaling or per-neuron bias.
Result: the bare YatNMN formula alone barely beats GELU (−0.03 nats on C4). The 0.14-nat advantage of full YatNMN-Softplus comes from the synergy between per-neuron bias and learnable α, not from the formula itself.
Ablation table (d=12, 261M, Chinchilla 20×, 3-seed mean)
| Variant | C4 smooth | wikitext PPL | vs GELU |
|---|---|---|---|
| YatNMN per-neuron + learnable α | 2.98 | 40.15 | −0.14 |
| YatNMN scalar_bias + learnable α | 3.06 | 39.53 | −0.06 |
| YatNMN per-neuron + constant α=1 | 3.10 | 67.09 | −0.02 |
| YatNMN sb + constant α=1 (this) | 3.09 | 78.34 | −0.03 |
| GELU | 3.12 | 46.52 | baseline |
Quick start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mlnomad/yatnmn-softplus-sb-ca-d12-chinchilla-261M-pytorch",
trust_remote_code=True, dtype=torch.float32,
).eval()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
with torch.no_grad():
out = model.generate(ids, max_new_tokens=50, do_sample=False,
use_cache=True, pad_token_id=tokenizer.eos_token_id or 0)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Config
Scalar (1,) bias + softplus_bias + learnable_epsilon + constant_alpha=True (α=1 fixed).
| Parameters | 261,096,362 |
| Final smooth loss | 3.08 (3-seed mean 3.09 ± 0.01) |
| Wikitext-103 PPL | 78.34 |
| Training data | allenai/c4, 5.22 B tokens (Chinchilla 20×) |
| Hardware | TPU v6e-8, europe-west4-a |
Related
mlnomad/yatnmn-softplus-sb-d12-chinchilla-261M-pytorch— with learnable α (loss 3.06, PPL 39.5)mlnomad/yatnmn-softplus-ca-d12-chinchilla-261M-pytorch— per-neuron bias + constant αmlnomad/gelu-d12-chinchilla-261M-pytorch— GELU baseline
License
Apache 2.0.
- Downloads last month
- 936