GatedDeltaNet-360M-15B-SlimPajama

This model is for research purposes only and is not intended for production use.

A GatedDeltaNet language model (357.8M parameters) pretrained from scratch on 15B tokens from SlimPajama.

GatedDeltaNet combines scalar decay gating with the delta rule for selective memory writing. See Yang et al., 2024.

Trained with flash-linear-attention and Flame.

Usage

Requires: pip install flash-linear-attention

import torch
import fla.models  # registers GatedDeltaNet with HuggingFace Auto classes

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "puigde/gated-deltanet-360M-15B-slimpajama",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
    "puigde/gated-deltanet-360M-15B-slimpajama"
)

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The import fla.models line registers the GatedDeltaNet architecture with HuggingFace's Auto classes. Without it, from_pretrained will fail with an unknown model type error.

Architecture

Based on the Flame gated_deltanet_340M.json config with two modifications: expand_v=1 (vs 2) and num_heads=4 (vs 6), which reduces the parameter count from ~460M to 358M.


Parameters	357,781,928
Layers	21
Hidden size	1,024
Heads	4
Head dim	256
expand_v	1
FFN	SwiGLU, 4x (intermediate 4,096)
Vocab size	32,000
Context length	2,048

Training


Dataset	cerebras/SlimPajama-627B, train split
Tokens	15,032,385,536
Steps	28,672
Batch size	256 sequences (16/GPU x 8 GPUs x 2 grad accum)
Sequence length	2,048
Optimizer	AdamW (fused), betas=(0.9, 0.95), eps=1e-15
Learning rate	4e-4 peak, cosine to 4e-5
Warmup	1,024 steps
Weight decay	0.1
Gradient clipping	1.0
Precision	bfloat16 compute, float32 reduce
Hardware	8x NVIDIA A100-SXM4-40GB
Training time	~9.3 hours
Final loss	2.509
Seed	42

Tokenizer: LlamaTokenizer (from fla-hub/gla-1.3B-100B), vocab 32,000.

Evaluation

Zero-shot, lm-evaluation-harness:

Benchmark	Metric	Score
HellaSwag	acc_norm	35.3
PIQA	acc_norm	65.3
ARC-Easy	acc	46.3
ARC-Challenge	acc_norm	23.3
WinoGrande	acc	49.5
LAMBADA	acc	32.3
BoolQ	acc	57.9
COPA	acc	70.0
SciQ	acc	78.8
WikiText-2	word_ppl	27.6

RULER (2k context): S1=1.00, S2=1.00, S3=0.66, MK1=0.32, avg=0.75.

Citation

@article{yang2024gateddeltanet,
  title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
  author={Yang, Songlin and Keller, Jan and Wang, Bailin and others},
  journal={arXiv preprint arXiv:2412.06464},
  year={2024}
}

Downloads last month: 454

Safetensors

Model size

0.4B params

Tensor type

F32

Collection including puigde/gated-deltanet-360M-15B-slimpajama

Sequence Modeling Baselines

Collection

Pretrained baselines for sequence modeling research. • 4 items • Updated 10 days ago

Paper for puigde/gated-deltanet-360M-15B-slimpajama

Gated Delta Networks: Improving Mamba2 with Delta Rule

Paper • 2412.06464 • Published Dec 9, 2024 • 17

Evaluation results

acc_norm on HellaSwag
self-reported

35.300
acc_norm on PIQA
self-reported

65.300
acc on ARC-Easy
self-reported

46.300
word_perplexity on WikiText-2
self-reported

27.600