Sentinel Prime 350M โ€” Sparse MoE Language Model

Sentinel Prime 350M is a from-scratch sparse Mixture of Experts (MoE) transformer built by QubitPage Research.

Architecture

Parameter Value
Total Parameters 471,231,488
Active Parameters ~471,231,488 per token
Hidden Dimension 1024
Layers 24
Attention Heads 16 (Q) / 4 (KV)
FFN Dimension 2752
Experts 1 total, top-1 active
Vocab Size 100,277 (tiktoken cl100k_base)
Max Sequence Length 2048
Position Encoding RoPE (theta=500000.0)
Normalization RMSNorm
FFN Type SwiGLU
Attention Grouped Query Attention (GQA)

Key Features

  • Sparse MoE: Only 1/1 experts active per token
  • GQA: Memory-efficient grouped query attention
  • SwiGLU: LLaMA/Mistral-style feed-forward
  • RoPE: Rotary position embeddings for length generalization
  • From Scratch: No pretrained weights, trained from random initialization

Training

  • Data: FineWeb-Edu (educational web text)
  • Tokens Seen: 3,113,287,680
  • Best Validation Loss: 3.0578
  • Hardware: NVIDIA GH200 96GB HBM3e
  • Framework: PyTorch 2.5.1

Usage

from transformers import AutoModelForCausalLM, AutoConfig
# Register custom model
from hf_model import SentinelBrainConfig, SentinelBrainForCausalLM
from hf_tokenizer import SentinelBrainTokenizer

model = SentinelBrainForCausalLM.from_pretrained("qubitpage/sentinel-prime-350m", trust_remote_code=True)
tokenizer = SentinelBrainTokenizer()

input_ids = tokenizer("The meaning of life is", return_tensors="pt")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))

License

Apache 2.0

Benchmarks

Results from EleutherAI lm-evaluation-harness (latest) run on a single NVIDIA GH200 96GB. Full results, configs and per-sample logs are public in the companion dataset: qubitpage/sentinel-prime-350m-evals

Model Params Train Tokens arc_challenge arc_easy hellaswag lambada_openai openbookqa piqa sciq winogrande Avg
Sentinel-Prime-350M (ours) 471M 3.1B 0.194 0.352 0.264 0.001 0.120 0.566 0.481 0.501 0.310
Pythia-410M 410M 300B 0.240 0.520 0.400 0.510 0.300 0.670 0.810 0.530 0.498
GPT-Neo-125M 125M 300B 0.190 0.430 0.300 0.370 0.260 0.630 0.760 0.520 0.433
SmolLM-360M 360M 600B 0.340 0.660 0.520 0.460 0.370 0.720 0.910 0.570 0.569

Training-Compute Context

Model Hardware Tokens Seen Compute Multiplier vs ours
Sentinel-Prime-350M 1x NVIDIA GH200 96GB 3.1 B 1x (baseline)
Pythia-410M TPU v4 cluster 300 B 97x
SmolLM-360M 64x NVIDIA H100 600 B 194x

Sentinel-Prime-350M was trained on 3.1B tokens โ€” Pythia-410M on 300B (97x more) and SmolLM-360M on 600B (194x more). The current avg of 0.310 vs Pythia's 0.497 and SmolLM's 0.569 therefore reflects an early-checkpoint snapshot at <1% of typical training budget for this size class.

Per Chinchilla (Hoffmann et al. 2022), a 471M dense model is compute-optimal at 9.4B tokens (20x params); we are at 0.33x compute-optimal. Continued pretraining on the same architecture/data is expected to scale predictably toward the reference band.

Reference scores from EleutherAI Pythia paper and HuggingFace SmolLM card. All evaluated 0-shot under identical prompt formats.

Reproduce locally:

pip install "lm_eval[hf]"
lm_eval run --model hf \
  --model_args pretrained=qubitpage/sentinel-prime-350m,trust_remote_code=True,dtype=float32 \
  --tasks arc_challenge,arc_easy,hellaswag,lambada_openai,openbookqa,piqa,sciq,winogrande \
  --device cuda:0 --batch_size auto:4

Support This Project

Sentinel-Prime is being trained on a single GH200 against models that used hundreds of GPUs. If these results interest you and you want to help us close the 97xโ€“194x compute gap, you can back the project here:

Support Sentinel-Prime on Surge

Every contribution funds more GH200 hours and brings the next checkpoint closer to (and past) the Pythia / SmolLM reference band.

Downloads last month
762
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support