BOREAL-10B-MoE
Balanced Orthogonal Recurrent Expert Attention Layers
The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model with ~2 billion active parameters per token. Trained from scratch on 15β20 trillion tokens using Token Superposition Training.
BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6 with DeepSeek-V4's hash-based expert routing. The result: a model that punches at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and inference throughput competitive with models 4β5x its active size.
Architecture
| Component | Detail |
|---|---|
| Type | Hybrid MoE β Gated DeltaNet + GQA + sparse experts |
| Total parameters | ~10B |
| Active parameters | ~2B per token |
| Hidden size | 2,560 |
| Layers | 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) |
| Full attention | GQA: 20 query heads, 4 KV heads, head_dim=256 |
| DeltaNet | Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 |
| Conv kernel | 4 |
| Routed experts | 128 total, 8 active per token |
| Expert FFN | SwiGLU, intermediate=768 per expert |
| Shared expert | 1 always-active expert, intermediate=1,536 |
| Expert routing | Sigmoid scoring + noaux_tc (zero auxiliary loss) |
| Dense FFN | SwiGLU, intermediate=7,680 (non-MoE layers) |
| Norm | RMSNorm, eps=1e-6 |
| Position | RoPE, theta=10M, partial_rotary_factor=0.25 |
| Output gate | Swish-gated attention and DeltaNet outputs |
| Vocab | 151,936 (Qwen3 tokenizer) |
| Context | 262,144 tokens native (256K) |
| MTP | 1 multi-token prediction head |
Layer Layout
Layer 0: Full GQA attention + dense FFN β first layer always full attention
Layer 1: Gated DeltaNet + dense FFN
Layer 2: Gated DeltaNet + dense FFN
Layer 3: Gated DeltaNet + dense FFN
Layer 4: Full GQA attention + dense FFN β every 4th layer
Layer 5: Gated DeltaNet + MoE FFN (128 experts)
Layer 6: Gated DeltaNet + dense FFN
Layer 7: Gated DeltaNet + MoE FFN (128 experts)
Layer 8: Full GQA attention + dense FFN
Layer 9: Gated DeltaNet + MoE FFN (128 experts)
...
Layer 36: Full GQA attention + dense FFN
Layer 37: Gated DeltaNet + MoE FFN (128 experts)
Layer 38: Gated DeltaNet + dense FFN
Layer 39: Gated DeltaNet + dense FFN β last layer dense
MoE layers are interleaved after the first 4 dense warmup layers, creating a gradient-rich environment where expert specialization can emerge naturally alongside the DeltaNet's recurrent state accumulation.
Expert Routing: DeepSeek-V4 Style
Sigmoid scoring. Unlike softmax routing (which forces a single dominant expert), sigmoid scoring allows multiple experts to activate independently. Each expert independently decides whether it can help with the current token.
noaux_tc. No auxiliary loss for load balancing. Instead, each expert has a learnable bias term that adjusts during training to naturally balance the load across experts. This avoids the quality degradation that auxiliary load-balancing losses impose on the main language modeling objective.
Fine-grained experts. 128 experts with small FFN dims (768) rather than fewer large experts. More experts means more specialization paths. At 8 active per token, the model blends 6.25% of expert capacity per forward pass.
Shared expert. One expert is always active with 2x the FFN capacity of routed experts (1,536 dim). This acts as a dense fallback β knowledge that every token needs regardless of routing decisions. Proven effective by both DeepSeek-V3 and Nemotron 3.
Hash routing (planned). DeepSeek-V4-Pro introduces hash-based candidate selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4, hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a learned hash function narrows candidates before the final top-k selection. This is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead from O(E) to O(log E).
Training
| Parameter | Value |
|---|---|
| Data tokens | 15β20 trillion |
| Corpus | Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) |
| TST | Token Superposition Training, s=4, r=0.5 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| Peak LR | 4.5e-4 (MoE requires higher LR than dense) |
| Schedule | Cosine decay to 10% of peak |
| Weight decay | 0.1 |
| Batch size | ~4M tokens/step |
| Precision | BF16 weights, FP32 DeltaNet states, FP32 router logits |
| Hardware | 256β512 H100/H200 GPUs (target) |
Training Phases
Phase 1 (TST, 7.5T tokens): Superposition mode, s=4 bags
Multi-hot CE on all DeltaNet and full-attn layers
MoE layers active with standard routing
Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP
Model recovers token-level precision
Expert specialization deepens
Phase 3 (Context Extension): Progressive 32K β 128K β 256K (~500B tokens)
YaRN RoPE scaling
Midtraining on long-document data
Phase 4 (Annealing, ~500B): High-quality data upsample
Decaying LR to 5e-6
Final quality polish
Expected Performance
| Benchmark | Target | Comparison |
|---|---|---|
| MMLU-Pro | 45β50% | Qwen3-8B: ~42% |
| HellaSwag | 78β84% | Qwen3-8B: ~79% |
| ARC-Challenge | 65β72% | Qwen3-8B: ~66% |
| GPQA Diamond | 35β40% | Qwen3-8B: ~34% |
| HumanEval (coding) | 55β65% | Qwen3-8B: ~58% |
| MATH (reasoning) | 35β45% | Qwen3-8B: ~38% |
Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer active parameters and supporting 256K native context. The MoE architecture extracts more quality per active parameter, and TST extracts more signal per training token.
Inference Efficiency
At 256K context, batch=1:
Qwen3-8B (pure Transformer, 8B active):
KV cache = 2 Γ 36 layers Γ 8 KV heads Γ 128 dim Γ 256K Γ 2 bytes = 37 GB
BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):
Full attn KV = 2 Γ 10 layers Γ 4 KV heads Γ 256 dim Γ 256K Γ 2 bytes = 10 GB
DeltaNet states = 30 layers Γ 2 Γ 32 V heads Γ 128Β² dims Γ 4 bytes = 12 MB
Total KV cache β 10 GB
Result: 3.7x smaller KV cache at 256K context.
~4x fewer FLOPs per generated token.
The BOREAL Family
| Model | Params | Type | Context | Status |
|---|---|---|---|---|
| BOREAL-250M | 250M | Dense DeltaNet | 32K | Architecture validation |
| BOREAL-2B | 2B | Dense DeltaNet | 64K | Community release |
| BOREAL-10B-MoE | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |
How It Was Built
Architecture Decisions
DeltaNet hybrid: Validated by Qwen3.5/3.6 (May/Nov 2025)
3:1 linear ratio: Qwen3.5 proved this ratio for <10B models
head_dim=256: Qwen3.5 and DeepSeek-V4 both moved to larger head dims
partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway
Swish output gates: Qwen3.6 addition, prevents attention blowup
noaux_tc routing: DeepSeek-V3 proved auxiliary-loss-free load balancing
Fine-grained MoE: 128 small experts > fewer large experts
Shared expert: 1 always-active 2x-capacity expert = dense fallback
TST training: Nous Research (arXiv:2605.06546), 1.5β2.5x speedup
Data Philosophy
BOREAL follows a data-quality-first approach. Pretraining data is the differentiator β everyone uses the same architectures now. Key principles:
- No LLM-generated pretraining data. Only human-authored text in the base corpus. LLM-generated data is reserved for post-training.
- Structural curation. Quality filtering goes beyond perplexity scoring to measure reasoning depth, self-correction density, and information content.
- Curriculum annealing. High-quality data concentrated in the final annealing phase rather than diluted across the full run.
Post-Training Pipeline (Planned)
SFT: 2β5M high-quality instruction pairs
Agentic reasoning, code, math, multilingual
GRPO: Multi-reward RL with GDPO normalization
Format + tool-use + reasoning depth + self-consistency + diversity
Budget: Thinking budget mechanism (Qwen3 innovation)
Dynamic compute allocation per query complexity
License
Apache 2.0.
Author
Developed by DJLougen.
The BOREAL family is pretrained from scratch β no fine-tuning, no distillation, no inherited weights. Born in Toronto. Trained with Canadian stubbornness.

docker model run hf.co/GestaltLabs/BOREAL-10B-MoE