Sentinel Prime 350M โ Sparse MoE Language Model
Sentinel Prime 350M is a from-scratch sparse Mixture of Experts (MoE) transformer built by QubitPage Research.
Architecture
| Parameter | Value |
|---|---|
| Total Parameters | 471,231,488 |
| Active Parameters | ~471,231,488 per token |
| Hidden Dimension | 1024 |
| Layers | 24 |
| Attention Heads | 16 (Q) / 4 (KV) |
| FFN Dimension | 2752 |
| Experts | 1 total, top-1 active |
| Vocab Size | 100,277 (tiktoken cl100k_base) |
| Max Sequence Length | 2048 |
| Position Encoding | RoPE (theta=500000.0) |
| Normalization | RMSNorm |
| FFN Type | SwiGLU |
| Attention | Grouped Query Attention (GQA) |
Key Features
- Sparse MoE: Only 1/1 experts active per token
- GQA: Memory-efficient grouped query attention
- SwiGLU: LLaMA/Mistral-style feed-forward
- RoPE: Rotary position embeddings for length generalization
- From Scratch: No pretrained weights, trained from random initialization
Training
- Data: FineWeb-Edu (educational web text)
- Tokens Seen: 3,113,287,680
- Best Validation Loss: 3.0578
- Hardware: NVIDIA GH200 96GB HBM3e
- Framework: PyTorch 2.5.1
Usage
from transformers import AutoModelForCausalLM, AutoConfig
# Register custom model
from hf_model import SentinelBrainConfig, SentinelBrainForCausalLM
from hf_tokenizer import SentinelBrainTokenizer
model = SentinelBrainForCausalLM.from_pretrained("qubitpage/sentinel-prime-350m", trust_remote_code=True)
tokenizer = SentinelBrainTokenizer()
input_ids = tokenizer("The meaning of life is", return_tensors="pt")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
License
Apache 2.0
Benchmarks
Results from EleutherAI lm-evaluation-harness (latest) run on a single NVIDIA GH200 96GB. Full results, configs and per-sample logs are public in the companion dataset:
qubitpage/sentinel-prime-350m-evals
| Model | Params | Train Tokens | arc_challenge | arc_easy | hellaswag | lambada_openai | openbookqa | piqa | sciq | winogrande | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sentinel-Prime-350M (ours) | 471M | 3.1B | 0.194 | 0.352 | 0.264 | 0.001 | 0.120 | 0.566 | 0.481 | 0.501 | 0.310 |
| Pythia-410M | 410M | 300B | 0.240 | 0.520 | 0.400 | 0.510 | 0.300 | 0.670 | 0.810 | 0.530 | 0.498 |
| GPT-Neo-125M | 125M | 300B | 0.190 | 0.430 | 0.300 | 0.370 | 0.260 | 0.630 | 0.760 | 0.520 | 0.433 |
| SmolLM-360M | 360M | 600B | 0.340 | 0.660 | 0.520 | 0.460 | 0.370 | 0.720 | 0.910 | 0.570 | 0.569 |
Training-Compute Context
| Model | Hardware | Tokens Seen | Compute Multiplier vs ours |
|---|---|---|---|
| Sentinel-Prime-350M | 1x NVIDIA GH200 96GB | 3.1 B | 1x (baseline) |
| Pythia-410M | TPU v4 cluster | 300 B | 97x |
| SmolLM-360M | 64x NVIDIA H100 | 600 B | 194x |
Sentinel-Prime-350M was trained on 3.1B tokens โ Pythia-410M on 300B (97x more) and SmolLM-360M on 600B (194x more). The current avg of 0.310 vs Pythia's 0.497 and SmolLM's 0.569 therefore reflects an early-checkpoint snapshot at <1% of typical training budget for this size class.
Per Chinchilla (Hoffmann et al. 2022), a 471M dense model is compute-optimal at 9.4B tokens (20x params); we are at 0.33x compute-optimal. Continued pretraining on the same architecture/data is expected to scale predictably toward the reference band.
Reference scores from EleutherAI Pythia paper and HuggingFace SmolLM card. All evaluated 0-shot under identical prompt formats.
Reproduce locally:
pip install "lm_eval[hf]"
lm_eval run --model hf \
--model_args pretrained=qubitpage/sentinel-prime-350m,trust_remote_code=True,dtype=float32 \
--tasks arc_challenge,arc_easy,hellaswag,lambada_openai,openbookqa,piqa,sciq,winogrande \
--device cuda:0 --batch_size auto:4
Support This Project
Sentinel-Prime is being trained on a single GH200 against models that used hundreds of GPUs. If these results interest you and you want to help us close the 97xโ194x compute gap, you can back the project here:
Support Sentinel-Prime on Surge
Every contribution funds more GH200 hours and brings the next checkpoint closer to (and past) the Pythia / SmolLM reference band.
- Downloads last month
- 762