---
language: en
license: apache-2.0
tags:
- causal-lm
- research
- fp8
- attention
- normalization
- neollm
datasets:
- HuggingFaceFW/fineweb-edu
---

# NeoLLM

NeoLLM is a **135 M parameter** decoder-only language model trained from scratch on
[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) in **FP8**
precision, completing training in approximately **6 hours** on a single NVIDIA RTX 5090.
It integrates a collection of recently published attention and normalization techniques
into a single architecture, with the goal of studying how they interact during
pretraining. The model is actively being developed and the current checkpoint represents
an intermediate training state.

> **Author / contact:** [@Kyokopom](https://x.com/Kyokopom) on X
> **Repository:** [KitsuVp/NeoLLM](https://huggingface.co/KitsuVp/NeoLLM)

---

## Architecture

NeoLLM is a decoder-only transformer with the following configuration:

| Parameter | Value |
|---|---|
| Hidden size | 512 |
| Layers | 12 |
| Attention heads | 8 |
| KV heads (GQA) | 4 |
| Head dim | 64 |
| Intermediate size | 1536 |
| Vocabulary | Qwen3 tokenizer (64,402 tokens) |
| Context length | 512 tokens |

### Parameter breakdown

| Parameter bucket | Count |
|---|---|
| **Total parameters** | 113.11M (113,112,016) |
| **Embedding parameters** (tied) | 32.97M (32,973,824) |
| **Non-embedding parameters** | 80.14M (80,138,192) |
| **Effective trainable parameters** | 113.11M (113,112,016) |

> Weight tying is **enabled**: the input embedding matrix and the language-model head
> share the same parameters, so the effective trainable budget is
> `total − embed = 80.14M`.

### Integrated techniques

Each layer combines the following mechanisms simultaneously.

**Normalization and residual stream**

- **SeeDNorm** ([arXiv:2510.22777](https://arxiv.org/abs/2510.22777)) — Applied to Q and K
  projections. Dynamically rescales the normalization based on the input's own statistics,
  making the attention geometry more stable across varying input distributions.
- **PolyNorm** ([arXiv:2602.04902](https://arxiv.org/abs/2602.04902)) — Replaces the standard
  MLP activation with three branches: linear (x), quadratic (x²), and cubic (x³) — each
  normalized and combined with learned weights. This allows the MLP to express both linear
  and non-linear relationships simultaneously.
- **GPAS** ([arXiv:2506.22049](https://arxiv.org/abs/2506.22049)) — Gradient-Preserving
  Activation Scaling. Applied to residual connections between sublayers; helps gradients
  flow more cleanly during training without distorting the residual stream.
- **LayerNorm Scaling / LNS** ([arXiv:2502.05795](https://arxiv.org/abs/2502.05795)) — Each
  layer's output is scaled by 1/√ℓ where ℓ is the layer index. Directly addresses the
  "Curse of Depth" in Pre-LN transformers.

**Attention mechanisms**

- **FAN** ([arXiv:2502.21309](https://arxiv.org/abs/2502.21309)) — Fourier Analysis Networks.
  A portion of the input projection channels are dedicated to representing periodic patterns
  (cosine/sine pairs), while the remainder handle standard linear content.
- **MEA** ([arXiv:2601.19611](https://arxiv.org/abs/2601.19611)) — Explicit Multi-head
  Attention. Adds small learnable interaction matrices between attention heads for K and V.
- **LUCID** ([arXiv:2602.10410](https://arxiv.org/abs/2602.10410)) — Applies a learned
  lower-triangular preconditioner to V before attention, decorrelating value representations
  across positions.
- **Affine-Scaled Attention** ([arXiv:2602.23057](https://arxiv.org/abs/2602.23057)) — Adds
  two learnable per-head scalars (α and β) to the softmax weights:
  `[α·softmax(QKᵀ) + β]·V`.
- **XSA** ([arXiv:2603.09078](https://arxiv.org/abs/2603.09078)) — Exclusive Self Attention.
  After computing attention, removes the component of the output aligned with the token's
  own value vector.
- **Directional Routing** ([arXiv:2603.14923](https://arxiv.org/abs/2603.14923)) — Each head
  learns K=4 directions in the output space; a learned router suppresses the attention output
  along each direction per input.
- **Gated Attention** ([arXiv:2505.06708](https://arxiv.org/abs/2505.06708)) — A sigmoid gate
  is applied to the attention output before the output projection, introducing non-linearity
  and preventing attention sinks.
- **Momentum Attention** ([arXiv:2411.03884](https://arxiv.org/abs/2411.03884)) — Modifies Q
  and K by subtracting a fraction of the previous position's Q and K values (causal
  first-difference).

**MLP**

- **Learnable Multipliers** ([arXiv:2601.04890](https://arxiv.org/abs/2601.04890)) — Adds
  per-row and per-column learnable scalar parameters to each linear layer.
- **SimpleGPT** ([arXiv:2602.01212](https://arxiv.org/abs/2602.01212)) — A normalization
  strategy derived from second-order geometry analysis, applied inside MLP projections to
  improve optimization stability.

---

## Training

| Setting | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-10BT) |
| Tokens seen | ~0.51B (15,625 steps × batch 64 × length 512) |
| Precision | FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback |
| Optimizer | Conda (Column-Normalized Adam) + GPA |
| Learning rate | 6e-04 with linear warmup (10 % of steps) |
| Weight decay | 0.1 |
| Training time | ~1h 15m |
| Hardware | NVIDIA RTX 5090 (single GPU) |

### Training curve

| Step | Train Loss | Val Loss |
|---|---|---|
| 5,000 | 4.098 | 4.042 |
| 10,000 | 3.793 | 3.750 |
| 15,000 | 3.665 | 3.614 |
| 15,625 | — | 3.607 |

---

## Limitations

- **Token budget** — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks
  will improve with more training.
- **Gradient spike at step 40k** — Reorganized the attention pattern in layer 9 that
  previously captured long-range token correlations. A checkpoint from ~step 38k is expected
  to have better aggregate benchmark scores.
- **PolyNorm exclusivity** — The quadratic branch has become partially redundant with the
  linear branch. Will be corrected in the next training run.
- **Base model only** — Not instruction-tuned or aligned; purely a next-token-prediction
  base model.

---

## References

All papers whose techniques are integrated into NeoLLM's architecture:

| Technique | Paper title | arXiv |
|---|---|---|
| SeeDNorm | Self-Rescaled Dynamic Normalization | [2510.22777](https://arxiv.org/abs/2510.22777) |
| MEA | Explicit Multi-head Attention | [2601.19611](https://arxiv.org/abs/2601.19611) |
| Learnable Multipliers | Freeing the Scale of Language Model Matrix Layers | [2601.04890](https://arxiv.org/abs/2601.04890) |
| Directional Routing | Directional Routing in Transformers | [2603.14923](https://arxiv.org/abs/2603.14923) |
| XSA | Exclusive Self Attention | [2603.09078](https://arxiv.org/abs/2603.09078) |
| Gated Attention | Gated Attention for LLMs | [2505.06708](https://arxiv.org/abs/2505.06708) |
| Affine-Scaled Attention | Affine-Scaled Attention | [2602.23057](https://arxiv.org/abs/2602.23057) |
| LNS | The Curse of Depth in LLMs | [2502.05795](https://arxiv.org/abs/2502.05795) |
| LUCID | Attention with Preconditioned Representations | [2602.10410](https://arxiv.org/abs/2602.10410) |
| FAN | Fourier Analysis Networks | [2502.21309](https://arxiv.org/abs/2502.21309) |
| SimpleGPT | SimpleGPT | [2602.01212](https://arxiv.org/abs/2602.01212) |
| GPAS | Gradient-Preserving Activation Scaling | [2506.22049](https://arxiv.org/abs/2506.22049) |
| PolyNorm | PolyNorm / PolyCom | [2602.04902](https://arxiv.org/abs/2602.04902) |
| Momentum Attention | Momentum Attention | [2411.03884](https://arxiv.org/abs/2411.03884) |
| TWEO (analysis ref.) | Transformers Without Extreme Outliers | [2511.23225](https://arxiv.org/abs/2511.23225) |

---

## Citation

```bibtex
@misc{neollm2026,
  title  = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
  author = {KitsuVp},
  year   = {2026},
  url    = {https://huggingface.co/KitsuVp/NeoLLM}
}
```

---

## Author

[@Kyokopom](https://x.com/Kyokopom) on X

---

## License

Apache 2.0