Modern-Transformer-GQA-370M-15B-SlimPajama

This model is for research purposes only and is not intended for production use.

A modern transformer language model (373.6M parameters) with Qwen3-Next-inspired architectural features, pretrained from scratch on 15B tokens from SlimPajama.

This model incorporates several design choices from recent efficient transformer architectures: grouped-query attention (GQA), partial rotary position embeddings, output gating, QK-normalization, and zero-centered RMSNorm. It serves as a modernized attention baseline for comparison with linear attention and state-space models in the Sequence Modeling Baselines collection.

Trained with flash-linear-attention and Flame.

Usage

Requires: pip install flash-linear-attention

import torch
import fla.models  # registers the Transformer architecture with HuggingFace Auto classes

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "puigde/modern-transformer-gqa-370M-15B-slimpajama",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
    "puigde/modern-transformer-gqa-370M-15B-slimpajama"
)

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The import fla.models line registers the Transformer architecture with HuggingFace's Auto classes. Without it, from_pretrained will fail with an unknown model type error.

Architecture


Parameters	373,620,224
Layers	25
Hidden size	1,024
Q heads	4
KV heads	1 (GQA, 4:1 ratio)
Head dim	256
Rotary dim	64 (25% partial RoPE)
Output gate	Sigmoid
QK norm	Yes
Normalization	Zero-centered RMSNorm
FFN	SwiGLU, 4x hidden ratio
Vocab size	32,000
Context length	2,048
Tied embeddings	No

Training


Dataset	cerebras/SlimPajama-627B, train split
Tokens	15,032,385,536
Steps	28,610
Batch size	256 sequences (8/GPU x 8 GPUs x 4 grad accum)
Sequence length	2,048
Optimizer	AdamW (fused), betas=(0.9, 0.95), eps=1e-15
Learning rate	4e-4 peak, cosine to 4e-5
Warmup	1,024 steps
Weight decay	0.1
Gradient clipping	1.0
Precision	bfloat16 compute, float32 reduce
Hardware	8x NVIDIA A100-SXM4-40GB
Training time	~16.5 hours
Final loss	2.521
Seed	42

Tokenizer: LlamaTokenizer (from fla-hub/gla-1.3B-100B), vocab 32,000.

Evaluation

Zero-shot, lm-evaluation-harness:

Benchmark	Metric	Score
HellaSwag	acc_norm	35.9
PIQA	acc_norm	65.5
ARC-Easy	acc	46.9
ARC-Challenge	acc_norm	24.7
WinoGrande	acc	51.9
LAMBADA	acc	32.4
BoolQ	acc	60.6
COPA	acc	73.0
SciQ	acc	75.5
OpenBookQA	acc_norm	30.6
WikiText-2	word_ppl	26.2

RULER (needle-in-a-haystack):

Task	1K	2K	4K
Single-1	1.00	1.00	0.00
Single-2	1.00	1.00	0.00
Single-3	0.83	0.78	0.00
Multi-key-1	0.76	0.73	0.01

Note: RULER at 4K is beyond the 2K training context and collapses for GQA. The MHA variant (modern-transformer-mha-370M-15B-slimpajama) retains partial 4K performance (S1=0.68, MK1=0.22).

Recall:

Task	Score
SWDE	0.60
FDA	0.25

Citation

@article{yang2024fla,
  title={Gated Linear Attention Transformers with Hardware-Efficient Training},
  author={Yang, Songlin and Wang, Bailin and Shen, Yikang and Panda, Rameswar and Kim, Yoon},
  journal={arXiv preprint arXiv:2312.06635},
  year={2024}
}

Downloads last month: 182

Safetensors

Model size

0.4B params

Tensor type

F32

Collection including puigde/modern-transformer-gqa-370M-15B-slimpajama

Sequence Modeling Baselines

Collection

Pretrained baselines for sequence modeling research. • 4 items • Updated 10 days ago

Paper for puigde/modern-transformer-gqa-370M-15B-slimpajama

Gated Linear Attention Transformers with Hardware-Efficient Training

Paper • 2312.06635 • Published Dec 11, 2023 • 9

Evaluation results

acc_norm on HellaSwag
self-reported

35.900
acc_norm on PIQA
self-reported

65.500
acc on ARC-Easy
self-reported

46.900
word_perplexity on WikiText-2
self-reported

26.200