GatedDeltaNet-360M-15B-SlimPajama
This model is for research purposes only and is not intended for production use.
A GatedDeltaNet language model (357.8M parameters) pretrained from scratch on 15B tokens from SlimPajama.
GatedDeltaNet combines scalar decay gating with the delta rule for selective memory writing. See Yang et al., 2024.
Trained with flash-linear-attention and Flame.
Usage
Requires: pip install flash-linear-attention
import torch
import fla.models # registers GatedDeltaNet with HuggingFace Auto classes
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"puigde/gated-deltanet-360M-15B-slimpajama",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
"puigde/gated-deltanet-360M-15B-slimpajama"
)
inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
The import fla.models line registers the GatedDeltaNet architecture with HuggingFace's Auto classes. Without it, from_pretrained will fail with an unknown model type error.
Architecture
Based on the Flame gated_deltanet_340M.json config with two modifications: expand_v=1 (vs 2) and num_heads=4 (vs 6), which reduces the parameter count from ~460M to 358M.
| Parameters | 357,781,928 |
| Layers | 21 |
| Hidden size | 1,024 |
| Heads | 4 |
| Head dim | 256 |
| expand_v | 1 |
| FFN | SwiGLU, 4x (intermediate 4,096) |
| Vocab size | 32,000 |
| Context length | 2,048 |
Training
| Dataset | cerebras/SlimPajama-627B, train split |
| Tokens | 15,032,385,536 |
| Steps | 28,672 |
| Batch size | 256 sequences (16/GPU x 8 GPUs x 2 grad accum) |
| Sequence length | 2,048 |
| Optimizer | AdamW (fused), betas=(0.9, 0.95), eps=1e-15 |
| Learning rate | 4e-4 peak, cosine to 4e-5 |
| Warmup | 1,024 steps |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | bfloat16 compute, float32 reduce |
| Hardware | 8x NVIDIA A100-SXM4-40GB |
| Training time | ~9.3 hours |
| Final loss | 2.509 |
| Seed | 42 |
Tokenizer: LlamaTokenizer (from fla-hub/gla-1.3B-100B), vocab 32,000.
Evaluation
Zero-shot, lm-evaluation-harness:
| Benchmark | Metric | Score |
|---|---|---|
| HellaSwag | acc_norm | 35.3 |
| PIQA | acc_norm | 65.3 |
| ARC-Easy | acc | 46.3 |
| ARC-Challenge | acc_norm | 23.3 |
| WinoGrande | acc | 49.5 |
| LAMBADA | acc | 32.3 |
| BoolQ | acc | 57.9 |
| COPA | acc | 70.0 |
| SciQ | acc | 78.8 |
| WikiText-2 | word_ppl | 27.6 |
RULER (2k context): S1=1.00, S2=1.00, S3=0.66, MK1=0.32, avg=0.75.
Citation
@article{yang2024gateddeltanet,
title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
author={Yang, Songlin and Keller, Jan and Wang, Bailin and others},
journal={arXiv preprint arXiv:2412.06464},
year={2024}
}
- Downloads last month
- 454
Collection including puigde/gated-deltanet-360M-15B-slimpajama
Paper for puigde/gated-deltanet-360M-15B-slimpajama
Evaluation results
- acc_norm on HellaSwagself-reported35.300
- acc_norm on PIQAself-reported65.300
- acc on ARC-Easyself-reported46.300
- word_perplexity on WikiText-2self-reported27.600