Transformers
English
english
graph-memory-transformer
memory-augmented
open-webtext
decoder-only
Eval Results (legacy)
Instructions to use NicolaZanarini/gmt-v7-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NicolaZanarini/gmt-v7-base with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("NicolaZanarini/gmt-v7-base", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: mit | |
| library_name: transformers | |
| tags: | |
| - english | |
| - graph-memory-transformer | |
| - memory-augmented | |
| - open-webtext | |
| - decoder-only | |
| metrics: | |
| - perplexity | |
| - accuracy | |
| model-index: | |
| - name: gmt-v7-base | |
| results: | |
| - task: | |
| type: language-modeling | |
| name: Language Modeling | |
| dataset: | |
| name: OpenWebText (validation) | |
| type: open-webtext | |
| metrics: | |
| - type: perplexity | |
| value: 36.58 | |
| name: Perplexity | |
| - task: | |
| type: text-classification | |
| name: ARC-Easy | |
| dataset: | |
| name: AI2 Reasoning Challenge (Easy) | |
| type: allenai/ai2_arc | |
| metrics: | |
| - type: accuracy | |
| value: 0.3704 | |
| name: Accuracy (raw, 0-shot) | |
| - task: | |
| type: text-classification | |
| name: WinoGrande | |
| dataset: | |
| name: WinoGrande | |
| type: winogrande | |
| metrics: | |
| - type: accuracy | |
| value: 0.5146 | |
| name: Accuracy (0-shot) | |
| datasets: | |
| - Skylion007/openwebtext | |
| # GMT v7 Base | |
| **Graph Memory Transformer (GMT) v7** | |
| Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a **graph-structured memory cell**. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret. | |
| **Relevance & Potential.** | |
| Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding *what* a layer computes by observing *where* it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered. | |
| [Paper (arXiv:2604.23862)](https://arxiv.org/abs/2604.23862) | [Code](https://github.com/Nemesis533/GMT-GraphMemoryTransformer) | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | **Architecture** | Decoder-only Transformer, FFN replaced by memory cells | | |
| | **Parameters** | 82.2M | | |
| | **Layers / Heads / Hidden** | 16 / 12 / 768 | | |
| | **Memory slots per layer** | 128 (2,048 total across network) | | |
| | **Navigation dimension** | 128 | | |
| | **Vocabulary size** | 50,257 (GPT-2 tokenizer) | | |
| | **Context length** | 1,024 | | |
| | **Tied embeddings** | Yes | | |
| **How it works.** Each block's memory cell performs three operations on a normalized token state: | |
| 1. **Source routing** — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting) | |
| 2. **Graph traversal + target selection** — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores | |
| 3. **Gated displacement readout** — the cell returns `σ(g) · LayerNorm(target − source)`, i.e. movement from source toward target memory state, not a retrieved value | |
| The model has **zero dense FFN sublayers**. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging. | |
| ## Training | |
| | Property | Value | | |
| |---|---| | |
| | **Dataset** | OpenWebText (~3B tokens, 2 epochs) | | |
| | **Optimizer** | AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) | | |
| | **Scheduler** | Cosine decay, 2,000 warmup steps | | |
| | **Effective batch** | 270,336 tokens (8 × 33 accum × 1,024 seq) | | |
| | **Precision** | bfloat16 mixed precision | | |
| | **Gradient clipping** | Norm 1.0 | | |
| | **Auxiliary losses** | Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast | | |
| ## Results | |
| Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads. | |
| | Benchmark | Metric | GMT v7 | Baseline GPT-2 | Δ | | |
| |---|---|---|---|---| | |
| | OpenWebText (val) | Perplexity | 36.58 | 26.85 | +9.73 | | |
| | ARC-Easy (0-shot) | Accuracy (raw) | 37.0% | 38.9% | −1.9 pp | | |
| | HellaSwag (0-shot) | Accuracy | 26.7% | 26.9% | −0.3 pp | | |
| | PIQA (0-shot) | Accuracy | 57.8% | 59.5% | −1.7 pp | | |
| | **WinoGrande (0-shot)** | **Accuracy** | **51.5%** | **50.5%** | **+1.0 pp** | | |
| GMT v7 operates at a **20% parameter disadvantage** (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval. | |
| ## Intended Use & Limitations | |
| - **Research prototype.** This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model. | |
| - The model remains behind the larger dense baseline in validation loss and most benchmarks. | |
| - Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work. | |
| ## Citation | |
| ```bibtex | |
| @article{zanarini2026graphmemorytransformer, | |
| title={Graph Memory Transformer}, | |
| author={Zanarini, Nicola and Ferrari, Niccol{\`o}}, | |
| journal={arXiv preprint arXiv:2604.23862}, | |
| year={2026} | |
| } | |
| ``` | |
| ## License | |
| [MIT](https://opensource.org/licenses/MIT) — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari. | |
| ## Authors | |
| Nicola Zanarini & Niccolò Ferrari |