Modern-Transformer-GQA-370M-15B-SlimPajama
This model is for research purposes only and is not intended for production use.
A modern transformer language model (373.6M parameters) with Qwen3-Next-inspired architectural features, pretrained from scratch on 15B tokens from SlimPajama.
This model incorporates several design choices from recent efficient transformer architectures: grouped-query attention (GQA), partial rotary position embeddings, output gating, QK-normalization, and zero-centered RMSNorm. It serves as a modernized attention baseline for comparison with linear attention and state-space models in the Sequence Modeling Baselines collection.
Trained with flash-linear-attention and Flame.
Usage
Requires: pip install flash-linear-attention
import torch
import fla.models # registers the Transformer architecture with HuggingFace Auto classes
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"puigde/modern-transformer-gqa-370M-15B-slimpajama",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
"puigde/modern-transformer-gqa-370M-15B-slimpajama"
)
inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
The import fla.models line registers the Transformer architecture with HuggingFace's Auto classes. Without it, from_pretrained will fail with an unknown model type error.
Architecture
| Parameters | 373,620,224 |
| Layers | 25 |
| Hidden size | 1,024 |
| Q heads | 4 |
| KV heads | 1 (GQA, 4:1 ratio) |
| Head dim | 256 |
| Rotary dim | 64 (25% partial RoPE) |
| Output gate | Sigmoid |
| QK norm | Yes |
| Normalization | Zero-centered RMSNorm |
| FFN | SwiGLU, 4x hidden ratio |
| Vocab size | 32,000 |
| Context length | 2,048 |
| Tied embeddings | No |
Training
| Dataset | cerebras/SlimPajama-627B, train split |
| Tokens | 15,032,385,536 |
| Steps | 28,610 |
| Batch size | 256 sequences (8/GPU x 8 GPUs x 4 grad accum) |
| Sequence length | 2,048 |
| Optimizer | AdamW (fused), betas=(0.9, 0.95), eps=1e-15 |
| Learning rate | 4e-4 peak, cosine to 4e-5 |
| Warmup | 1,024 steps |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | bfloat16 compute, float32 reduce |
| Hardware | 8x NVIDIA A100-SXM4-40GB |
| Training time | ~16.5 hours |
| Final loss | 2.521 |
| Seed | 42 |
Tokenizer: LlamaTokenizer (from fla-hub/gla-1.3B-100B), vocab 32,000.
Evaluation
Zero-shot, lm-evaluation-harness:
| Benchmark | Metric | Score |
|---|---|---|
| HellaSwag | acc_norm | 35.9 |
| PIQA | acc_norm | 65.5 |
| ARC-Easy | acc | 46.9 |
| ARC-Challenge | acc_norm | 24.7 |
| WinoGrande | acc | 51.9 |
| LAMBADA | acc | 32.4 |
| BoolQ | acc | 60.6 |
| COPA | acc | 73.0 |
| SciQ | acc | 75.5 |
| OpenBookQA | acc_norm | 30.6 |
| WikiText-2 | word_ppl | 26.2 |
RULER (needle-in-a-haystack):
| Task | 1K | 2K | 4K |
|---|---|---|---|
| Single-1 | 1.00 | 1.00 | 0.00 |
| Single-2 | 1.00 | 1.00 | 0.00 |
| Single-3 | 0.83 | 0.78 | 0.00 |
| Multi-key-1 | 0.76 | 0.73 | 0.01 |
Note: RULER at 4K is beyond the 2K training context and collapses for GQA. The MHA variant (modern-transformer-mha-370M-15B-slimpajama) retains partial 4K performance (S1=0.68, MK1=0.22).
Recall:
| Task | Score |
|---|---|
| SWDE | 0.60 |
| FDA | 0.25 |
Citation
@article{yang2024fla,
title={Gated Linear Attention Transformers with Hardware-Efficient Training},
author={Yang, Songlin and Wang, Bailin and Shen, Yikang and Panda, Rameswar and Kim, Yoon},
journal={arXiv preprint arXiv:2312.06635},
year={2024}
}
- Downloads last month
- 182
Collection including puigde/modern-transformer-gqa-370M-15B-slimpajama
Paper for puigde/modern-transformer-gqa-370M-15B-slimpajama
Evaluation results
- acc_norm on HellaSwagself-reported35.900
- acc_norm on PIQAself-reported65.500
- acc on ARC-Easyself-reported46.900
- word_perplexity on WikiText-2self-reported26.200