Sentinel Prime Nano โ Sparse MoE Language Model
Sentinel Prime is a from-scratch sparse Mixture of Experts (MoE) transformer built by QubitPage Research.
Architecture
| Parameter | Value |
|---|---|
| Total Parameters | 322,435,584 |
| Active Parameters | ~161,217,792 per token |
| Hidden Dimension | 768 |
| Layers | 12 |
| Attention Heads | 12 (Q) / 4 (KV) |
| FFN Dimension | 2048 |
| Experts | 4 total, top-2 active |
| Vocab Size | 100,277 (tiktoken cl100k_base) |
| Max Sequence Length | 1024 |
| Position Encoding | RoPE (theta=500000.0) |
| Normalization | RMSNorm |
| FFN Type | SwiGLU |
| Attention | Grouped Query Attention (GQA) |
Key Features
- Sparse MoE: Only 2/4 experts active per token
- GQA: Memory-efficient grouped query attention
- SwiGLU: LLaMA/Mistral-style feed-forward
- RoPE: Rotary position embeddings for length generalization
- From Scratch: No pretrained weights, trained from random initialization
Training
- Data: FineWeb-Edu (educational web text)
- Tokens Seen: 698,368
- Best Validation Loss: 10.1536
- Hardware: NVIDIA RTX 3060 12GB
- Framework: PyTorch 2.11.0+cu126
Usage
from transformers import AutoModelForCausalLM, AutoConfig
# Register custom model
from hf_model import SentinelBrainConfig, SentinelBrainForCausalLM
from hf_tokenizer import SentinelBrainTokenizer
model = SentinelBrainForCausalLM.from_pretrained("qubitpage/sentinel-prime-nano")
tokenizer = SentinelBrainTokenizer()
input_ids = tokenizer("The meaning of life is", return_tensors="pt")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
License
Apache 2.0
- Downloads last month
- 286