Instructions to use qubitpage/sentinel-prime-nano-moe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use qubitpage/sentinel-prime-nano-moe with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="qubitpage/sentinel-prime-nano-moe")

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("qubitpage/sentinel-prime-nano-moe", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use qubitpage/sentinel-prime-nano-moe with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "qubitpage/sentinel-prime-nano-moe"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-nano-moe",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/qubitpage/sentinel-prime-nano-moe

SGLang

How to use qubitpage/sentinel-prime-nano-moe with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "qubitpage/sentinel-prime-nano-moe" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-nano-moe",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "qubitpage/sentinel-prime-nano-moe" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-nano-moe",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use qubitpage/sentinel-prime-nano-moe with Docker Model Runner:
```
docker model run hf.co/qubitpage/sentinel-prime-nano-moe
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sentinel Prime Nano — Sparse MoE Language Model

Sentinel Prime is a from-scratch sparse Mixture of Experts (MoE) transformer built by QubitPage Research.

Architecture

Parameter	Value
Total Parameters	322,435,584
Active Parameters	~161,217,792 per token
Hidden Dimension	768
Layers	12
Attention Heads	12 (Q) / 4 (KV)
FFN Dimension	2048
Experts	4 total, top-2 active
Vocab Size	100,277 (tiktoken cl100k_base)
Max Sequence Length	1024
Position Encoding	RoPE (theta=500000.0)
Normalization	RMSNorm
FFN Type	SwiGLU
Attention	Grouped Query Attention (GQA)

Key Features

Sparse MoE: Only 2/4 experts active per token
GQA: Memory-efficient grouped query attention
SwiGLU: LLaMA/Mistral-style feed-forward
RoPE: Rotary position embeddings for length generalization
From Scratch: No pretrained weights, trained from random initialization

Training

Data: FineWeb-Edu (educational web text)
Tokens Seen: 698,368
Best Validation Loss: 10.1536
Hardware: NVIDIA RTX 3060 12GB
Framework: PyTorch 2.11.0+cu126

Usage

from transformers import AutoModelForCausalLM, AutoConfig
# Register custom model
from hf_model import SentinelBrainConfig, SentinelBrainForCausalLM
from hf_tokenizer import SentinelBrainTokenizer

model = SentinelBrainForCausalLM.from_pretrained("qubitpage/sentinel-prime-nano")
tokenizer = SentinelBrainTokenizer()

input_ids = tokenizer("The meaning of life is", return_tensors="pt")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))

License

Apache 2.0

Downloads last month: 286

Safetensors

Model size

0.3B params

Tensor type

F32