--- language: en license: apache-2.0 tags: - causal-lm - research - fp8 - attention - normalization - neollm datasets: - HuggingFaceFW/fineweb-edu --- # NeoLLM NeoLLM is a **135 M parameter** decoder-only language model trained from scratch on [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) in **FP8** precision, completing training in approximately **6 hours** on a single NVIDIA RTX 5090. It integrates a collection of recently published attention and normalization techniques into a single architecture, with the goal of studying how they interact during pretraining. The model is actively being developed and the current checkpoint represents an intermediate training state. > **Author / contact:** [@Kyokopom](https://x.com/Kyokopom) on X > **Repository:** [KitsuVp/NeoLLM](https://huggingface.co/KitsuVp/NeoLLM) --- ## Architecture NeoLLM is a decoder-only transformer with the following configuration: | Parameter | Value | |---|---| | Hidden size | 512 | | Layers | 12 | | Attention heads | 8 | | KV heads (GQA) | 4 | | Head dim | 64 | | Intermediate size | 1536 | | Vocabulary | Qwen3 tokenizer (64,402 tokens) | | Context length | 512 tokens | ### Parameter breakdown | Parameter bucket | Count | |---|---| | **Total parameters** | 113.11M (113,112,016) | | **Embedding parameters** (tied) | 32.97M (32,973,824) | | **Non-embedding parameters** | 80.14M (80,138,192) | | **Effective trainable parameters** | 113.11M (113,112,016) | > Weight tying is **enabled**: the input embedding matrix and the language-model head > share the same parameters, so the effective trainable budget is > `total − embed = 80.14M`. ### Integrated techniques Each layer combines the following mechanisms simultaneously. **Normalization and residual stream** - **SeeDNorm** ([arXiv:2510.22777](https://arxiv.org/abs/2510.22777)) — Applied to Q and K projections. Dynamically rescales the normalization based on the input's own statistics, making the attention geometry more stable across varying input distributions. - **PolyNorm** ([arXiv:2602.04902](https://arxiv.org/abs/2602.04902)) — Replaces the standard MLP activation with three branches: linear (x), quadratic (x²), and cubic (x³) — each normalized and combined with learned weights. This allows the MLP to express both linear and non-linear relationships simultaneously. - **GPAS** ([arXiv:2506.22049](https://arxiv.org/abs/2506.22049)) — Gradient-Preserving Activation Scaling. Applied to residual connections between sublayers; helps gradients flow more cleanly during training without distorting the residual stream. - **LayerNorm Scaling / LNS** ([arXiv:2502.05795](https://arxiv.org/abs/2502.05795)) — Each layer's output is scaled by 1/√ℓ where ℓ is the layer index. Directly addresses the "Curse of Depth" in Pre-LN transformers. **Attention mechanisms** - **FAN** ([arXiv:2502.21309](https://arxiv.org/abs/2502.21309)) — Fourier Analysis Networks. A portion of the input projection channels are dedicated to representing periodic patterns (cosine/sine pairs), while the remainder handle standard linear content. - **MEA** ([arXiv:2601.19611](https://arxiv.org/abs/2601.19611)) — Explicit Multi-head Attention. Adds small learnable interaction matrices between attention heads for K and V. - **LUCID** ([arXiv:2602.10410](https://arxiv.org/abs/2602.10410)) — Applies a learned lower-triangular preconditioner to V before attention, decorrelating value representations across positions. - **Affine-Scaled Attention** ([arXiv:2602.23057](https://arxiv.org/abs/2602.23057)) — Adds two learnable per-head scalars (α and β) to the softmax weights: `[α·softmax(QKᵀ) + β]·V`. - **XSA** ([arXiv:2603.09078](https://arxiv.org/abs/2603.09078)) — Exclusive Self Attention. After computing attention, removes the component of the output aligned with the token's own value vector. - **Directional Routing** ([arXiv:2603.14923](https://arxiv.org/abs/2603.14923)) — Each head learns K=4 directions in the output space; a learned router suppresses the attention output along each direction per input. - **Gated Attention** ([arXiv:2505.06708](https://arxiv.org/abs/2505.06708)) — A sigmoid gate is applied to the attention output before the output projection, introducing non-linearity and preventing attention sinks. - **Momentum Attention** ([arXiv:2411.03884](https://arxiv.org/abs/2411.03884)) — Modifies Q and K by subtracting a fraction of the previous position's Q and K values (causal first-difference). **MLP** - **Learnable Multipliers** ([arXiv:2601.04890](https://arxiv.org/abs/2601.04890)) — Adds per-row and per-column learnable scalar parameters to each linear layer. - **SimpleGPT** ([arXiv:2602.01212](https://arxiv.org/abs/2602.01212)) — A normalization strategy derived from second-order geometry analysis, applied inside MLP projections to improve optimization stability. --- ## Training | Setting | Value | |---|---| | Dataset | FineWeb-Edu (sample-10BT) | | Tokens seen | ~0.51B (15,625 steps × batch 64 × length 512) | | Precision | FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback | | Optimizer | Conda (Column-Normalized Adam) + GPA | | Learning rate | 6e-04 with linear warmup (10 % of steps) | | Weight decay | 0.1 | | Training time | ~1h 15m | | Hardware | NVIDIA RTX 5090 (single GPU) | ### Training curve | Step | Train Loss | Val Loss | |---|---|---| | 5,000 | 4.098 | 4.042 | | 10,000 | 3.793 | 3.750 | | 15,000 | 3.665 | 3.614 | | 15,625 | — | 3.607 | --- ## Limitations - **Token budget** — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks will improve with more training. - **Gradient spike at step 40k** — Reorganized the attention pattern in layer 9 that previously captured long-range token correlations. A checkpoint from ~step 38k is expected to have better aggregate benchmark scores. - **PolyNorm exclusivity** — The quadratic branch has become partially redundant with the linear branch. Will be corrected in the next training run. - **Base model only** — Not instruction-tuned or aligned; purely a next-token-prediction base model. --- ## References All papers whose techniques are integrated into NeoLLM's architecture: | Technique | Paper title | arXiv | |---|---|---| | SeeDNorm | Self-Rescaled Dynamic Normalization | [2510.22777](https://arxiv.org/abs/2510.22777) | | MEA | Explicit Multi-head Attention | [2601.19611](https://arxiv.org/abs/2601.19611) | | Learnable Multipliers | Freeing the Scale of Language Model Matrix Layers | [2601.04890](https://arxiv.org/abs/2601.04890) | | Directional Routing | Directional Routing in Transformers | [2603.14923](https://arxiv.org/abs/2603.14923) | | XSA | Exclusive Self Attention | [2603.09078](https://arxiv.org/abs/2603.09078) | | Gated Attention | Gated Attention for LLMs | [2505.06708](https://arxiv.org/abs/2505.06708) | | Affine-Scaled Attention | Affine-Scaled Attention | [2602.23057](https://arxiv.org/abs/2602.23057) | | LNS | The Curse of Depth in LLMs | [2502.05795](https://arxiv.org/abs/2502.05795) | | LUCID | Attention with Preconditioned Representations | [2602.10410](https://arxiv.org/abs/2602.10410) | | FAN | Fourier Analysis Networks | [2502.21309](https://arxiv.org/abs/2502.21309) | | SimpleGPT | SimpleGPT | [2602.01212](https://arxiv.org/abs/2602.01212) | | GPAS | Gradient-Preserving Activation Scaling | [2506.22049](https://arxiv.org/abs/2506.22049) | | PolyNorm | PolyNorm / PolyCom | [2602.04902](https://arxiv.org/abs/2602.04902) | | Momentum Attention | Momentum Attention | [2411.03884](https://arxiv.org/abs/2411.03884) | | TWEO (analysis ref.) | Transformers Without Extreme Outliers | [2511.23225](https://arxiv.org/abs/2511.23225) | --- ## Citation ```bibtex @misc{neollm2026, title = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques}, author = {KitsuVp}, year = {2026}, url = {https://huggingface.co/KitsuVp/NeoLLM} } ``` --- ## Author [@Kyokopom](https://x.com/Kyokopom) on X --- ## License Apache 2.0