GatedDeltaNet-360M-15B-SlimPajama

This model is for research purposes only and is not intended for production use.

A GatedDeltaNet language model (357.8M parameters) pretrained from scratch on 15B tokens from SlimPajama.

GatedDeltaNet combines scalar decay gating with the delta rule for selective memory writing. See Yang et al., 2024.

Trained with flash-linear-attention and Flame.

Usage

Requires: pip install flash-linear-attention

import torch
import fla.models  # registers GatedDeltaNet with HuggingFace Auto classes

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "puigde/gated-deltanet-360M-15B-slimpajama",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
    "puigde/gated-deltanet-360M-15B-slimpajama"
)

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The import fla.models line registers the GatedDeltaNet architecture with HuggingFace's Auto classes. Without it, from_pretrained will fail with an unknown model type error.

Architecture

Based on the Flame gated_deltanet_340M.json config with two modifications: expand_v=1 (vs 2) and num_heads=4 (vs 6), which reduces the parameter count from ~460M to 358M.

Parameters 357,781,928
Layers 21
Hidden size 1,024
Heads 4
Head dim 256
expand_v 1
FFN SwiGLU, 4x (intermediate 4,096)
Vocab size 32,000
Context length 2,048

Training

Dataset cerebras/SlimPajama-627B, train split
Tokens 15,032,385,536
Steps 28,672
Batch size 256 sequences (16/GPU x 8 GPUs x 2 grad accum)
Sequence length 2,048
Optimizer AdamW (fused), betas=(0.9, 0.95), eps=1e-15
Learning rate 4e-4 peak, cosine to 4e-5
Warmup 1,024 steps
Weight decay 0.1
Gradient clipping 1.0
Precision bfloat16 compute, float32 reduce
Hardware 8x NVIDIA A100-SXM4-40GB
Training time ~9.3 hours
Final loss 2.509
Seed 42

Tokenizer: LlamaTokenizer (from fla-hub/gla-1.3B-100B), vocab 32,000.

Evaluation

Zero-shot, lm-evaluation-harness:

Benchmark Metric Score
HellaSwag acc_norm 35.3
PIQA acc_norm 65.3
ARC-Easy acc 46.3
ARC-Challenge acc_norm 23.3
WinoGrande acc 49.5
LAMBADA acc 32.3
BoolQ acc 57.9
COPA acc 70.0
SciQ acc 78.8
WikiText-2 word_ppl 27.6

RULER (2k context): S1=1.00, S2=1.00, S3=0.66, MK1=0.32, avg=0.75.

Citation

@article{yang2024gateddeltanet,
  title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
  author={Yang, Songlin and Keller, Jan and Wang, Bailin and others},
  journal={arXiv preprint arXiv:2412.06464},
  year={2024}
}
Downloads last month
454
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including puigde/gated-deltanet-360M-15B-slimpajama

Paper for puigde/gated-deltanet-360M-15B-slimpajama

Evaluation results