GMT v7 Base
Graph Memory Transformer (GMT) v7
Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a graph-structured memory cell. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.
Relevance & Potential.
Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding what a layer computes by observing where it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.
Paper (arXiv:2604.23862) | Code
Model Details
| Property | Value |
|---|---|
| Architecture | Decoder-only Transformer, FFN replaced by memory cells |
| Parameters | 82.2M |
| Layers / Heads / Hidden | 16 / 12 / 768 |
| Memory slots per layer | 128 (2,048 total across network) |
| Navigation dimension | 128 |
| Vocabulary size | 50,257 (GPT-2 tokenizer) |
| Context length | 1,024 |
| Tied embeddings | Yes |
How it works. Each block's memory cell performs three operations on a normalized token state:
- Source routing — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
- Graph traversal + target selection — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
- Gated displacement readout — the cell returns
σ(g) · LayerNorm(target − source), i.e. movement from source toward target memory state, not a retrieved value
The model has zero dense FFN sublayers. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.
Training
| Property | Value |
|---|---|
| Dataset | OpenWebText (~3B tokens, 2 epochs) |
| Optimizer | AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) |
| Scheduler | Cosine decay, 2,000 warmup steps |
| Effective batch | 270,336 tokens (8 × 33 accum × 1,024 seq) |
| Precision | bfloat16 mixed precision |
| Gradient clipping | Norm 1.0 |
| Auxiliary losses | Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast |
Results
Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.
| Benchmark | Metric | GMT v7 | Baseline GPT-2 | Δ |
|---|---|---|---|---|
| OpenWebText (val) | Perplexity | 36.58 | 26.85 | +9.73 |
| ARC-Easy (0-shot) | Accuracy (raw) | 37.0% | 38.9% | −1.9 pp |
| HellaSwag (0-shot) | Accuracy | 26.7% | 26.9% | −0.3 pp |
| PIQA (0-shot) | Accuracy | 57.8% | 59.5% | −1.7 pp |
| WinoGrande (0-shot) | Accuracy | 51.5% | 50.5% | +1.0 pp |
GMT v7 operates at a 20% parameter disadvantage (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.
Intended Use & Limitations
- Research prototype. This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
- The model remains behind the larger dense baseline in validation loss and most benchmarks.
- Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.
Citation
@article{zanarini2026graphmemorytransformer,
title={Graph Memory Transformer},
author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
journal={arXiv preprint arXiv:2604.23862},
year={2026}
}
License
MIT — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari.
Authors
Nicola Zanarini & Niccolò Ferrari
Dataset used to train NicolaZanarini/gmt-v7-base
Paper for NicolaZanarini/gmt-v7-base
Evaluation results
- Perplexity on OpenWebText (validation)self-reported36.580
- Accuracy (raw, 0-shot) on AI2 Reasoning Challenge (Easy)self-reported0.370
- Accuracy (0-shot) on WinoGrandeself-reported0.515