gmt-v7-base / README.md
NicolaZanarini's picture
Update README.md
21741f5 verified
---
language: en
license: mit
library_name: transformers
tags:
- english
- graph-memory-transformer
- memory-augmented
- open-webtext
- decoder-only
metrics:
- perplexity
- accuracy
model-index:
- name: gmt-v7-base
results:
- task:
type: language-modeling
name: Language Modeling
dataset:
name: OpenWebText (validation)
type: open-webtext
metrics:
- type: perplexity
value: 36.58
name: Perplexity
- task:
type: text-classification
name: ARC-Easy
dataset:
name: AI2 Reasoning Challenge (Easy)
type: allenai/ai2_arc
metrics:
- type: accuracy
value: 0.3704
name: Accuracy (raw, 0-shot)
- task:
type: text-classification
name: WinoGrande
dataset:
name: WinoGrande
type: winogrande
metrics:
- type: accuracy
value: 0.5146
name: Accuracy (0-shot)
datasets:
- Skylion007/openwebtext
---
# GMT v7 Base
**Graph Memory Transformer (GMT) v7**
Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a **graph-structured memory cell**. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.
**Relevance & Potential.**
Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding *what* a layer computes by observing *where* it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.
[Paper (arXiv:2604.23862)](https://arxiv.org/abs/2604.23862) | [Code](https://github.com/Nemesis533/GMT-GraphMemoryTransformer)
## Model Details
| Property | Value |
|---|---|
| **Architecture** | Decoder-only Transformer, FFN replaced by memory cells |
| **Parameters** | 82.2M |
| **Layers / Heads / Hidden** | 16 / 12 / 768 |
| **Memory slots per layer** | 128 (2,048 total across network) |
| **Navigation dimension** | 128 |
| **Vocabulary size** | 50,257 (GPT-2 tokenizer) |
| **Context length** | 1,024 |
| **Tied embeddings** | Yes |
**How it works.** Each block's memory cell performs three operations on a normalized token state:
1. **Source routing** — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
2. **Graph traversal + target selection** — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
3. **Gated displacement readout** — the cell returns `σ(g) · LayerNorm(target − source)`, i.e. movement from source toward target memory state, not a retrieved value
The model has **zero dense FFN sublayers**. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.
## Training
| Property | Value |
|---|---|
| **Dataset** | OpenWebText (~3B tokens, 2 epochs) |
| **Optimizer** | AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) |
| **Scheduler** | Cosine decay, 2,000 warmup steps |
| **Effective batch** | 270,336 tokens (8 × 33 accum × 1,024 seq) |
| **Precision** | bfloat16 mixed precision |
| **Gradient clipping** | Norm 1.0 |
| **Auxiliary losses** | Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast |
## Results
Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.
| Benchmark | Metric | GMT v7 | Baseline GPT-2 | Δ |
|---|---|---|---|---|
| OpenWebText (val) | Perplexity | 36.58 | 26.85 | +9.73 |
| ARC-Easy (0-shot) | Accuracy (raw) | 37.0% | 38.9% | −1.9 pp |
| HellaSwag (0-shot) | Accuracy | 26.7% | 26.9% | −0.3 pp |
| PIQA (0-shot) | Accuracy | 57.8% | 59.5% | −1.7 pp |
| **WinoGrande (0-shot)** | **Accuracy** | **51.5%** | **50.5%** | **+1.0 pp** |
GMT v7 operates at a **20% parameter disadvantage** (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.
## Intended Use & Limitations
- **Research prototype.** This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
- The model remains behind the larger dense baseline in validation loss and most benchmarks.
- Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.
## Citation
```bibtex
@article{zanarini2026graphmemorytransformer,
title={Graph Memory Transformer},
author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
journal={arXiv preprint arXiv:2604.23862},
year={2026}
}
```
## License
[MIT](https://opensource.org/licenses/MIT) — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari.
## Authors
Nicola Zanarini & Niccolò Ferrari