GMT v7 Base

Graph Memory Transformer (GMT) v7

Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a graph-structured memory cell. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.

Relevance & Potential.

Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding what a layer computes by observing where it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.

Paper (arXiv:2604.23862) | Code

Model Details

Property	Value
Architecture	Decoder-only Transformer, FFN replaced by memory cells
Parameters	82.2M
Layers / Heads / Hidden	16 / 12 / 768
Memory slots per layer	128 (2,048 total across network)
Navigation dimension	128
Vocabulary size	50,257 (GPT-2 tokenizer)
Context length	1,024
Tied embeddings	Yes

How it works. Each block's memory cell performs three operations on a normalized token state:

Source routing — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
Graph traversal + target selection — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
Gated displacement readout — the cell returns σ(g) · LayerNorm(target − source), i.e. movement from source toward target memory state, not a retrieved value

The model has zero dense FFN sublayers. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.

Training

Property	Value
Dataset	OpenWebText (~3B tokens, 2 epochs)
Optimizer	AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95))
Scheduler	Cosine decay, 2,000 warmup steps
Effective batch	270,336 tokens (8 × 33 accum × 1,024 seq)
Precision	bfloat16 mixed precision
Gradient clipping	Norm 1.0
Auxiliary losses	Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast

Results

Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.

Benchmark	Metric	GMT v7	Baseline GPT-2	Δ
OpenWebText (val)	Perplexity	36.58	26.85	+9.73
ARC-Easy (0-shot)	Accuracy (raw)	37.0%	38.9%	−1.9 pp
HellaSwag (0-shot)	Accuracy	26.7%	26.9%	−0.3 pp
PIQA (0-shot)	Accuracy	57.8%	59.5%	−1.7 pp
WinoGrande (0-shot)	Accuracy	51.5%	50.5%	+1.0 pp

GMT v7 operates at a 20% parameter disadvantage (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.

Intended Use & Limitations

Research prototype. This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
The model remains behind the larger dense baseline in validation loss and most benchmarks.
Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.

Citation

@article{zanarini2026graphmemorytransformer,
  title={Graph Memory Transformer},
  author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
  journal={arXiv preprint arXiv:2604.23862},
  year={2026}
}

License

Authors

Nicola Zanarini & Niccolò Ferrari

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train NicolaZanarini/gmt-v7-base

Paper for NicolaZanarini/gmt-v7-base

Graph Memory Transformer (GMT)

Paper • 2604.23862 • Published 12 days ago • 2

Evaluation results

Perplexity on OpenWebText (validation)
self-reported

36.580
Accuracy (raw, 0-shot) on AI2 Reasoning Challenge (Easy)
self-reported

0.370
Accuracy (0-shot) on WinoGrande
self-reported

0.515