Text Generation
Transformers
English
boreal
deltanet
hybrid
Mixture of Experts
mixture-of-experts
linear-attention
swiglu
rmsnorm
rope
gqa
deepseek-routing
hash-routing
pretraining
canada
Instructions to use GestaltLabs/BOREAL-10B-MoE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use GestaltLabs/BOREAL-10B-MoE with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="GestaltLabs/BOREAL-10B-MoE")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("GestaltLabs/BOREAL-10B-MoE", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use GestaltLabs/BOREAL-10B-MoE with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "GestaltLabs/BOREAL-10B-MoE" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GestaltLabs/BOREAL-10B-MoE", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/GestaltLabs/BOREAL-10B-MoE
- SGLang
How to use GestaltLabs/BOREAL-10B-MoE with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "GestaltLabs/BOREAL-10B-MoE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GestaltLabs/BOREAL-10B-MoE", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "GestaltLabs/BOREAL-10B-MoE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GestaltLabs/BOREAL-10B-MoE", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use GestaltLabs/BOREAL-10B-MoE with Docker Model Runner:
docker model run hf.co/GestaltLabs/BOREAL-10B-MoE
File size: 9,679 Bytes
9855b1d b2091e2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | ---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- moe
- mixture-of-experts
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- deepseek-routing
- hash-routing
- pretraining
- canada
pipeline_tag: text-generation
base_model: DJLougen/BOREAL-10B-MoE
---

# BOREAL-10B-MoE
**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model
with ~2 billion active parameters per token. Trained from scratch on 15β20
trillion tokens using Token Superposition Training.
BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6
with DeepSeek-V4's hash-based expert routing. The result: a model that punches
at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and
inference throughput competitive with models 4β5x its active size.
## Architecture
| Component | Detail |
|-----------|--------|
| **Type** | Hybrid MoE β Gated DeltaNet + GQA + sparse experts |
| **Total parameters** | ~10B |
| **Active parameters** | ~2B per token |
| **Hidden size** | 2,560 |
| **Layers** | 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) |
| **Full attention** | GQA: 20 query heads, 4 KV heads, head_dim=256 |
| **DeltaNet** | Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 |
| **Conv kernel** | 4 |
| **Routed experts** | 128 total, 8 active per token |
| **Expert FFN** | SwiGLU, intermediate=768 per expert |
| **Shared expert** | 1 always-active expert, intermediate=1,536 |
| **Expert routing** | Sigmoid scoring + noaux_tc (zero auxiliary loss) |
| **Dense FFN** | SwiGLU, intermediate=7,680 (non-MoE layers) |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated attention and DeltaNet outputs |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 262,144 tokens native (256K) |
| **MTP** | 1 multi-token prediction head |
### Layer Layout
```
Layer 0: Full GQA attention + dense FFN β first layer always full attention
Layer 1: Gated DeltaNet + dense FFN
Layer 2: Gated DeltaNet + dense FFN
Layer 3: Gated DeltaNet + dense FFN
Layer 4: Full GQA attention + dense FFN β every 4th layer
Layer 5: Gated DeltaNet + MoE FFN (128 experts)
Layer 6: Gated DeltaNet + dense FFN
Layer 7: Gated DeltaNet + MoE FFN (128 experts)
Layer 8: Full GQA attention + dense FFN
Layer 9: Gated DeltaNet + MoE FFN (128 experts)
...
Layer 36: Full GQA attention + dense FFN
Layer 37: Gated DeltaNet + MoE FFN (128 experts)
Layer 38: Gated DeltaNet + dense FFN
Layer 39: Gated DeltaNet + dense FFN β last layer dense
```
MoE layers are interleaved after the first 4 dense warmup layers, creating a
gradient-rich environment where expert specialization can emerge naturally
alongside the DeltaNet's recurrent state accumulation.
### Expert Routing: DeepSeek-V4 Style
**Sigmoid scoring.** Unlike softmax routing (which forces a single dominant
expert), sigmoid scoring allows multiple experts to activate independently.
Each expert independently decides whether it can help with the current token.
**noaux_tc.** No auxiliary loss for load balancing. Instead, each expert has a
learnable bias term that adjusts during training to naturally balance the load
across experts. This avoids the quality degradation that auxiliary load-balancing
losses impose on the main language modeling objective.
**Fine-grained experts.** 128 experts with small FFN dims (768) rather than
fewer large experts. More experts means more specialization paths. At 8 active
per token, the model blends 6.25% of expert capacity per forward pass.
**Shared expert.** One expert is always active with 2x the FFN capacity of
routed experts (1,536 dim). This acts as a dense fallback β knowledge that
every token needs regardless of routing decisions. Proven effective by both
DeepSeek-V3 and Nemotron 3.
**Hash routing (planned).** DeepSeek-V4-Pro introduces hash-based candidate
selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4,
hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a
learned hash function narrows candidates before the final top-k selection. This
is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead
from O(E) to O(log E).
## Training
| Parameter | Value |
|-----------|-------|
| **Data tokens** | 15β20 trillion |
| **Corpus** | Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) |
| **TST** | Token Superposition Training, s=4, r=0.5 |
| **Optimizer** | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| **Peak LR** | 4.5e-4 (MoE requires higher LR than dense) |
| **Schedule** | Cosine decay to 10% of peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states, FP32 router logits |
| **Hardware** | 256β512 H100/H200 GPUs (target) |
### Training Phases
```
Phase 1 (TST, 7.5T tokens): Superposition mode, s=4 bags
Multi-hot CE on all DeltaNet and full-attn layers
MoE layers active with standard routing
Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP
Model recovers token-level precision
Expert specialization deepens
Phase 3 (Context Extension): Progressive 32K β 128K β 256K (~500B tokens)
YaRN RoPE scaling
Midtraining on long-document data
Phase 4 (Annealing, ~500B): High-quality data upsample
Decaying LR to 5e-6
Final quality polish
```
## Expected Performance
| Benchmark | Target | Comparison |
|-----------|--------|-------------|
| MMLU-Pro | 45β50% | Qwen3-8B: ~42% |
| HellaSwag | 78β84% | Qwen3-8B: ~79% |
| ARC-Challenge | 65β72% | Qwen3-8B: ~66% |
| GPQA Diamond | 35β40% | Qwen3-8B: ~34% |
| HumanEval (coding) | 55β65% | Qwen3-8B: ~58% |
| MATH (reasoning) | 35β45% | Qwen3-8B: ~38% |
Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer
active parameters and supporting 256K native context. The MoE architecture
extracts more quality per active parameter, and TST extracts more signal
per training token.
### Inference Efficiency
```
At 256K context, batch=1:
Qwen3-8B (pure Transformer, 8B active):
KV cache = 2 Γ 36 layers Γ 8 KV heads Γ 128 dim Γ 256K Γ 2 bytes = 37 GB
BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):
Full attn KV = 2 Γ 10 layers Γ 4 KV heads Γ 256 dim Γ 256K Γ 2 bytes = 10 GB
DeltaNet states = 30 layers Γ 2 Γ 32 V heads Γ 128Β² dims Γ 4 bytes = 12 MB
Total KV cache β 10 GB
Result: 3.7x smaller KV cache at 256K context.
~4x fewer FLOPs per generated token.
```
## The BOREAL Family
| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation |
| **[BOREAL-2B](https://huggingface.co/DJLougen/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Community release |
| **BOREAL-10B-MoE** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |
## How It Was Built
### Architecture Decisions
```
DeltaNet hybrid: Validated by Qwen3.5/3.6 (May/Nov 2025)
3:1 linear ratio: Qwen3.5 proved this ratio for <10B models
head_dim=256: Qwen3.5 and DeepSeek-V4 both moved to larger head dims
partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway
Swish output gates: Qwen3.6 addition, prevents attention blowup
noaux_tc routing: DeepSeek-V3 proved auxiliary-loss-free load balancing
Fine-grained MoE: 128 small experts > fewer large experts
Shared expert: 1 always-active 2x-capacity expert = dense fallback
TST training: Nous Research (arXiv:2605.06546), 1.5β2.5x speedup
```
### Data Philosophy
BOREAL follows a data-quality-first approach. Pretraining data is the
differentiator β everyone uses the same architectures now. Key principles:
- **No LLM-generated pretraining data.** Only human-authored text in the
base corpus. LLM-generated data is reserved for post-training.
- **Structural curation.** Quality filtering goes beyond perplexity scoring
to measure reasoning depth, self-correction density, and information content.
- **Curriculum annealing.** High-quality data concentrated in the final
annealing phase rather than diluted across the full run.
## Post-Training Pipeline (Planned)
```
SFT: 2β5M high-quality instruction pairs
Agentic reasoning, code, math, multilingual
GRPO: Multi-reward RL with GDPO normalization
Format + tool-use + reasoning depth + self-consistency + diversity
Budget: Thinking budget mechanism (Qwen3 innovation)
Dynamic compute allocation per query complexity
```
## License
Apache 2.0.
## Author
Developed by [DJLougen](https://huggingface.co/DJLougen).
The BOREAL family is pretrained from scratch β no fine-tuning, no distillation,
no inherited weights. Born in Toronto. Trained with Canadian stubbornness.
[β Support on Ko-fi](https://ko-fi.com/djlougen) |