Modern-Transformer-GQA-370M-15B-SlimPajama

This model is for research purposes only and is not intended for production use.

A modern transformer language model (373.6M parameters) with Qwen3-Next-inspired architectural features, pretrained from scratch on 15B tokens from SlimPajama.

This model incorporates several design choices from recent efficient transformer architectures: grouped-query attention (GQA), partial rotary position embeddings, output gating, QK-normalization, and zero-centered RMSNorm. It serves as a modernized attention baseline for comparison with linear attention and state-space models in the Sequence Modeling Baselines collection.

Trained with flash-linear-attention and Flame.

Usage

Requires: pip install flash-linear-attention

import torch
import fla.models  # registers the Transformer architecture with HuggingFace Auto classes

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "puigde/modern-transformer-gqa-370M-15B-slimpajama",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
    "puigde/modern-transformer-gqa-370M-15B-slimpajama"
)

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The import fla.models line registers the Transformer architecture with HuggingFace's Auto classes. Without it, from_pretrained will fail with an unknown model type error.

Architecture

Parameters 373,620,224
Layers 25
Hidden size 1,024
Q heads 4
KV heads 1 (GQA, 4:1 ratio)
Head dim 256
Rotary dim 64 (25% partial RoPE)
Output gate Sigmoid
QK norm Yes
Normalization Zero-centered RMSNorm
FFN SwiGLU, 4x hidden ratio
Vocab size 32,000
Context length 2,048
Tied embeddings No

Training

Dataset cerebras/SlimPajama-627B, train split
Tokens 15,032,385,536
Steps 28,610
Batch size 256 sequences (8/GPU x 8 GPUs x 4 grad accum)
Sequence length 2,048
Optimizer AdamW (fused), betas=(0.9, 0.95), eps=1e-15
Learning rate 4e-4 peak, cosine to 4e-5
Warmup 1,024 steps
Weight decay 0.1
Gradient clipping 1.0
Precision bfloat16 compute, float32 reduce
Hardware 8x NVIDIA A100-SXM4-40GB
Training time ~16.5 hours
Final loss 2.521
Seed 42

Tokenizer: LlamaTokenizer (from fla-hub/gla-1.3B-100B), vocab 32,000.

Evaluation

Zero-shot, lm-evaluation-harness:

Benchmark Metric Score
HellaSwag acc_norm 35.9
PIQA acc_norm 65.5
ARC-Easy acc 46.9
ARC-Challenge acc_norm 24.7
WinoGrande acc 51.9
LAMBADA acc 32.4
BoolQ acc 60.6
COPA acc 73.0
SciQ acc 75.5
OpenBookQA acc_norm 30.6
WikiText-2 word_ppl 26.2

RULER (needle-in-a-haystack):

Task 1K 2K 4K
Single-1 1.00 1.00 0.00
Single-2 1.00 1.00 0.00
Single-3 0.83 0.78 0.00
Multi-key-1 0.76 0.73 0.01

Note: RULER at 4K is beyond the 2K training context and collapses for GQA. The MHA variant (modern-transformer-mha-370M-15B-slimpajama) retains partial 4K performance (S1=0.68, MK1=0.22).

Recall:

Task Score
SWDE 0.60
FDA 0.25

Citation

@article{yang2024fla,
  title={Gated Linear Attention Transformers with Hardware-Efficient Training},
  author={Yang, Songlin and Wang, Bailin and Shen, Yikang and Panda, Rameswar and Kim, Yoon},
  journal={arXiv preprint arXiv:2312.06635},
  year={2024}
}
Downloads last month
182
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including puigde/modern-transformer-gqa-370M-15B-slimpajama

Paper for puigde/modern-transformer-gqa-370M-15B-slimpajama

Evaluation results