Update README.md

21741f5 verified 24 days ago

5.7 kB

	---
	language: en
	license: mit
	library_name: transformers
	tags:
	- english
	- graph-memory-transformer
	- memory-augmented
	- open-webtext
	- decoder-only
	metrics:
	- perplexity
	- accuracy
	model-index:
	- name: gmt-v7-base
	results:
	- task:
	type: language-modeling
	name: Language Modeling
	dataset:
	name: OpenWebText (validation)
	type: open-webtext
	metrics:
	- type: perplexity
	value: 36.58
	name: Perplexity
	- task:
	type: text-classification
	name: ARC-Easy
	dataset:
	name: AI2 Reasoning Challenge (Easy)
	type: allenai/ai2_arc
	metrics:
	- type: accuracy
	value: 0.3704
	name: Accuracy (raw, 0-shot)
	- task:
	type: text-classification
	name: WinoGrande
	dataset:
	name: WinoGrande
	type: winogrande
	metrics:
	- type: accuracy
	value: 0.5146
	name: Accuracy (0-shot)
	datasets:
	- Skylion007/openwebtext
	---

	# GMT v7 Base

	Graph Memory Transformer (GMT) v7

	Is a decoder-only language model in which every feed-forward network (FFN) sublayer is replaced by a graph-structured memory cell. Causal self-attention is preserved; the per-token FFN transformation is substituted with memory navigation over a learned bank of centroids connected by a directed transition matrix. The substitution of the FFN with a learned bank of centroids connected by a directed transition matrix aims at making the relations inside the transformer more explicit and easier to interpret.

	Relevance & Potential.

	Transformers dominate modern AI, yet their internal computations remain opaque — FFN sublayers act as unstructured function approximators with no clear semantic structure. By replacing FFNs with a navigable graph of memory states, GMT turns each layer's transformation into a traceable path over discrete, inspectable nodes and edges. This opens the door to mechanistic interpretability at scale: understanding what a layer computes by observing where it routes information. Beyond interpretability, graph-structured memory offers a principled route toward more data-efficient learning, structured knowledge retrieval, and models whose reasoning can be audited rather than reverse-engineered.

	[Paper (arXiv:2604.23862)](https://arxiv.org/abs/2604.23862) \| [Code](https://github.com/Nemesis533/GMT-GraphMemoryTransformer)

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| Decoder-only Transformer, FFN replaced by memory cells \|
	\| Parameters \| 82.2M \|
	\| Layers / Heads / Hidden \| 16 / 12 / 768 \|
	\| Memory slots per layer \| 128 (2,048 total across network) \|
	\| Navigation dimension \| 128 \|
	\| Vocabulary size \| 50,257 (GPT-2 tokenizer) \|
	\| Context length \| 1,024 \|
	\| Tied embeddings \| Yes \|

	How it works. Each block's memory cell performs three operations on a normalized token state:
	1. Source routing — gravitational soft assignment over 128 layer-local centroids (inverse-distance weighting)
	2. Graph traversal + target selection — one-hop diffusion through a learned 128×128 directed edge matrix, refined by token-conditioned query-key scores
	3. Gated displacement readout — the cell returns `σ(g) · LayerNorm(target − source)`, i.e. movement from source toward target memory state, not a retrieved value

	The model has zero dense FFN sublayers. Centroids are maintained online via EMA write-back and periodic dead-centroid reset / similarity merging.

	## Training

	\| Property \| Value \|
	\|---\|---\|
	\| Dataset \| OpenWebText (~3B tokens, 2 epochs) \|
	\| Optimizer \| AdamW (lr=3e-4, weight decay=0.1, β=(0.9, 0.95)) \|
	\| Scheduler \| Cosine decay, 2,000 warmup steps \|
	\| Effective batch \| 270,336 tokens (8 × 33 accum × 1,024 seq) \|
	\| Precision \| bfloat16 mixed precision \|
	\| Gradient clipping \| Norm 1.0 \|
	\| Auxiliary losses \| Tracking, centroid orthogonality, usage clustering, edge entropy, edge contrast \|

	## Results

	Evaluated at best validation-loss checkpoint (step 18,310). Baseline is a 103.0M-parameter dense GPT-2 with matched depth/hidden/heads.

	\| Benchmark \| Metric \| GMT v7 \| Baseline GPT-2 \| Δ \|
	\|---\|---\|---\|---\|---\|
	\| OpenWebText (val) \| Perplexity \| 36.58 \| 26.85 \| +9.73 \|
	\| ARC-Easy (0-shot) \| Accuracy (raw) \| 37.0% \| 38.9% \| −1.9 pp \|
	\| HellaSwag (0-shot) \| Accuracy \| 26.7% \| 26.9% \| −0.3 pp \|
	\| PIQA (0-shot) \| Accuracy \| 57.8% \| 59.5% \| −1.7 pp \|
	\| WinoGrande (0-shot) \| Accuracy \| 51.5% \| 50.5% \| +1.0 pp \|

	GMT v7 operates at a 20% parameter disadvantage (82.2M vs 103.0M). Gaps are consistent with this disadvantage, while WinoGrande — a pronoun-resolution task requiring commonsense referent disambiguation — shows the memory graph may be particularly suited to structured association retrieval.

	## Intended Use & Limitations

	- Research prototype. This model demonstrates that FFN sublayers can be replaced by inspectable graph-memory cells without breaking training stability. It is not intended as a production language model.
	- The model remains behind the larger dense baseline in validation loss and most benchmarks.
	- Results are from a single run on a single checkpoint; broader scaling and more extensive evaluation are left for future work.

	## Citation

	```bibtex
	@article{zanarini2026graphmemorytransformer,
	title={Graph Memory Transformer},
	author={Zanarini, Nicola and Ferrari, Niccol{\`o}},
	journal={arXiv preprint arXiv:2604.23862},
	year={2026}
	}
	```

	## License

	[MIT](https://opensource.org/licenses/MIT) — Copyright (c) 2026 Nicola Zanarini and Niccolò Ferrari.

	## Authors

	Nicola Zanarini & Niccolò Ferrari