BOREAL-10B-MoE / README.md
DJLougen's picture
Upload README.md with huggingface_hub
9855b1d verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- moe
- mixture-of-experts
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- deepseek-routing
- hash-routing
- pretraining
- canada
pipeline_tag: text-generation
base_model: DJLougen/BOREAL-10B-MoE
---
![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE/resolve/main/Boreal.png)
# BOREAL-10B-MoE
**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model
with ~2 billion active parameters per token. Trained from scratch on 15–20
trillion tokens using Token Superposition Training.
BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6
with DeepSeek-V4's hash-based expert routing. The result: a model that punches
at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and
inference throughput competitive with models 4–5x its active size.
## Architecture
| Component | Detail |
|-----------|--------|
| **Type** | Hybrid MoE β€” Gated DeltaNet + GQA + sparse experts |
| **Total parameters** | ~10B |
| **Active parameters** | ~2B per token |
| **Hidden size** | 2,560 |
| **Layers** | 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) |
| **Full attention** | GQA: 20 query heads, 4 KV heads, head_dim=256 |
| **DeltaNet** | Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 |
| **Conv kernel** | 4 |
| **Routed experts** | 128 total, 8 active per token |
| **Expert FFN** | SwiGLU, intermediate=768 per expert |
| **Shared expert** | 1 always-active expert, intermediate=1,536 |
| **Expert routing** | Sigmoid scoring + noaux_tc (zero auxiliary loss) |
| **Dense FFN** | SwiGLU, intermediate=7,680 (non-MoE layers) |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated attention and DeltaNet outputs |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 262,144 tokens native (256K) |
| **MTP** | 1 multi-token prediction head |
### Layer Layout
```
Layer 0: Full GQA attention + dense FFN ← first layer always full attention
Layer 1: Gated DeltaNet + dense FFN
Layer 2: Gated DeltaNet + dense FFN
Layer 3: Gated DeltaNet + dense FFN
Layer 4: Full GQA attention + dense FFN ← every 4th layer
Layer 5: Gated DeltaNet + MoE FFN (128 experts)
Layer 6: Gated DeltaNet + dense FFN
Layer 7: Gated DeltaNet + MoE FFN (128 experts)
Layer 8: Full GQA attention + dense FFN
Layer 9: Gated DeltaNet + MoE FFN (128 experts)
...
Layer 36: Full GQA attention + dense FFN
Layer 37: Gated DeltaNet + MoE FFN (128 experts)
Layer 38: Gated DeltaNet + dense FFN
Layer 39: Gated DeltaNet + dense FFN ← last layer dense
```
MoE layers are interleaved after the first 4 dense warmup layers, creating a
gradient-rich environment where expert specialization can emerge naturally
alongside the DeltaNet's recurrent state accumulation.
### Expert Routing: DeepSeek-V4 Style
**Sigmoid scoring.** Unlike softmax routing (which forces a single dominant
expert), sigmoid scoring allows multiple experts to activate independently.
Each expert independently decides whether it can help with the current token.
**noaux_tc.** No auxiliary loss for load balancing. Instead, each expert has a
learnable bias term that adjusts during training to naturally balance the load
across experts. This avoids the quality degradation that auxiliary load-balancing
losses impose on the main language modeling objective.
**Fine-grained experts.** 128 experts with small FFN dims (768) rather than
fewer large experts. More experts means more specialization paths. At 8 active
per token, the model blends 6.25% of expert capacity per forward pass.
**Shared expert.** One expert is always active with 2x the FFN capacity of
routed experts (1,536 dim). This acts as a dense fallback β€” knowledge that
every token needs regardless of routing decisions. Proven effective by both
DeepSeek-V3 and Nemotron 3.
**Hash routing (planned).** DeepSeek-V4-Pro introduces hash-based candidate
selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4,
hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a
learned hash function narrows candidates before the final top-k selection. This
is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead
from O(E) to O(log E).
## Training
| Parameter | Value |
|-----------|-------|
| **Data tokens** | 15–20 trillion |
| **Corpus** | Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) |
| **TST** | Token Superposition Training, s=4, r=0.5 |
| **Optimizer** | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
| **Peak LR** | 4.5e-4 (MoE requires higher LR than dense) |
| **Schedule** | Cosine decay to 10% of peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states, FP32 router logits |
| **Hardware** | 256–512 H100/H200 GPUs (target) |
### Training Phases
```
Phase 1 (TST, 7.5T tokens): Superposition mode, s=4 bags
Multi-hot CE on all DeltaNet and full-attn layers
MoE layers active with standard routing
Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP
Model recovers token-level precision
Expert specialization deepens
Phase 3 (Context Extension): Progressive 32K β†’ 128K β†’ 256K (~500B tokens)
YaRN RoPE scaling
Midtraining on long-document data
Phase 4 (Annealing, ~500B): High-quality data upsample
Decaying LR to 5e-6
Final quality polish
```
## Expected Performance
| Benchmark | Target | Comparison |
|-----------|--------|-------------|
| MMLU-Pro | 45–50% | Qwen3-8B: ~42% |
| HellaSwag | 78–84% | Qwen3-8B: ~79% |
| ARC-Challenge | 65–72% | Qwen3-8B: ~66% |
| GPQA Diamond | 35–40% | Qwen3-8B: ~34% |
| HumanEval (coding) | 55–65% | Qwen3-8B: ~58% |
| MATH (reasoning) | 35–45% | Qwen3-8B: ~38% |
Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer
active parameters and supporting 256K native context. The MoE architecture
extracts more quality per active parameter, and TST extracts more signal
per training token.
### Inference Efficiency
```
At 256K context, batch=1:
Qwen3-8B (pure Transformer, 8B active):
KV cache = 2 Γ— 36 layers Γ— 8 KV heads Γ— 128 dim Γ— 256K Γ— 2 bytes = 37 GB
BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):
Full attn KV = 2 Γ— 10 layers Γ— 4 KV heads Γ— 256 dim Γ— 256K Γ— 2 bytes = 10 GB
DeltaNet states = 30 layers Γ— 2 Γ— 32 V heads Γ— 128Β² dims Γ— 4 bytes = 12 MB
Total KV cache β‰ˆ 10 GB
Result: 3.7x smaller KV cache at 256K context.
~4x fewer FLOPs per generated token.
```
## The BOREAL Family
| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation |
| **[BOREAL-2B](https://huggingface.co/DJLougen/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Community release |
| **BOREAL-10B-MoE** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |
## How It Was Built
### Architecture Decisions
```
DeltaNet hybrid: Validated by Qwen3.5/3.6 (May/Nov 2025)
3:1 linear ratio: Qwen3.5 proved this ratio for <10B models
head_dim=256: Qwen3.5 and DeepSeek-V4 both moved to larger head dims
partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway
Swish output gates: Qwen3.6 addition, prevents attention blowup
noaux_tc routing: DeepSeek-V3 proved auxiliary-loss-free load balancing
Fine-grained MoE: 128 small experts > fewer large experts
Shared expert: 1 always-active 2x-capacity expert = dense fallback
TST training: Nous Research (arXiv:2605.06546), 1.5–2.5x speedup
```
### Data Philosophy
BOREAL follows a data-quality-first approach. Pretraining data is the
differentiator β€” everyone uses the same architectures now. Key principles:
- **No LLM-generated pretraining data.** Only human-authored text in the
base corpus. LLM-generated data is reserved for post-training.
- **Structural curation.** Quality filtering goes beyond perplexity scoring
to measure reasoning depth, self-correction density, and information content.
- **Curriculum annealing.** High-quality data concentrated in the final
annealing phase rather than diluted across the full run.
## Post-Training Pipeline (Planned)
```
SFT: 2–5M high-quality instruction pairs
Agentic reasoning, code, math, multilingual
GRPO: Multi-reward RL with GDPO normalization
Format + tool-use + reasoning depth + self-consistency + diversity
Budget: Thinking budget mechanism (Qwen3 innovation)
Dynamic compute allocation per query complexity
```
## License
Apache 2.0.
## Author
Developed by [DJLougen](https://huggingface.co/DJLougen).
The BOREAL family is pretrained from scratch β€” no fine-tuning, no distillation,
no inherited weights. Born in Toronto. Trained with Canadian stubbornness.
[β˜• Support on Ko-fi](https://ko-fi.com/djlougen)