| ---
|
| language:
|
| - en
|
| license: apache-2.0
|
| library_name: transformers
|
| tags:
|
| - boreal
|
| - deltanet
|
| - hybrid
|
| - moe
|
| - mixture-of-experts
|
| - linear-attention
|
| - swiglu
|
| - rmsnorm
|
| - rope
|
| - gqa
|
| - deepseek-routing
|
| - hash-routing
|
| - pretraining
|
| - canada
|
| pipeline_tag: text-generation
|
| base_model: DJLougen/BOREAL-10B-MoE
|
| ---
|
|
|
| 
|
|
|
| # BOREAL-10B-MoE
|
|
|
| **B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
|
|
|
| The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model
|
| with ~2 billion active parameters per token. Trained from scratch on 15β20
|
| trillion tokens using Token Superposition Training.
|
|
|
| BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6
|
| with DeepSeek-V4's hash-based expert routing. The result: a model that punches
|
| at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and
|
| inference throughput competitive with models 4β5x its active size.
|
|
|
| ## Architecture
|
|
|
| | Component | Detail |
|
| |-----------|--------|
|
| | **Type** | Hybrid MoE β Gated DeltaNet + GQA + sparse experts |
|
| | **Total parameters** | ~10B |
|
| | **Active parameters** | ~2B per token |
|
| | **Hidden size** | 2,560 |
|
| | **Layers** | 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) |
|
| | **Full attention** | GQA: 20 query heads, 4 KV heads, head_dim=256 |
|
| | **DeltaNet** | Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 |
|
| | **Conv kernel** | 4 |
|
| | **Routed experts** | 128 total, 8 active per token |
|
| | **Expert FFN** | SwiGLU, intermediate=768 per expert |
|
| | **Shared expert** | 1 always-active expert, intermediate=1,536 |
|
| | **Expert routing** | Sigmoid scoring + noaux_tc (zero auxiliary loss) |
|
| | **Dense FFN** | SwiGLU, intermediate=7,680 (non-MoE layers) |
|
| | **Norm** | RMSNorm, eps=1e-6 |
|
| | **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
|
| | **Output gate** | Swish-gated attention and DeltaNet outputs |
|
| | **Vocab** | 151,936 (Qwen3 tokenizer) |
|
| | **Context** | 262,144 tokens native (256K) |
|
| | **MTP** | 1 multi-token prediction head |
|
|
|
| ### Layer Layout
|
|
|
| ```
|
| Layer 0: Full GQA attention + dense FFN β first layer always full attention
|
| Layer 1: Gated DeltaNet + dense FFN
|
| Layer 2: Gated DeltaNet + dense FFN
|
| Layer 3: Gated DeltaNet + dense FFN
|
| Layer 4: Full GQA attention + dense FFN β every 4th layer
|
| Layer 5: Gated DeltaNet + MoE FFN (128 experts)
|
| Layer 6: Gated DeltaNet + dense FFN
|
| Layer 7: Gated DeltaNet + MoE FFN (128 experts)
|
| Layer 8: Full GQA attention + dense FFN
|
| Layer 9: Gated DeltaNet + MoE FFN (128 experts)
|
| ...
|
|
|
| Layer 36: Full GQA attention + dense FFN
|
| Layer 37: Gated DeltaNet + MoE FFN (128 experts)
|
| Layer 38: Gated DeltaNet + dense FFN
|
| Layer 39: Gated DeltaNet + dense FFN β last layer dense
|
| ```
|
|
|
| MoE layers are interleaved after the first 4 dense warmup layers, creating a
|
| gradient-rich environment where expert specialization can emerge naturally
|
| alongside the DeltaNet's recurrent state accumulation.
|
|
|
| ### Expert Routing: DeepSeek-V4 Style
|
|
|
| **Sigmoid scoring.** Unlike softmax routing (which forces a single dominant
|
| expert), sigmoid scoring allows multiple experts to activate independently.
|
| Each expert independently decides whether it can help with the current token.
|
|
|
| **noaux_tc.** No auxiliary loss for load balancing. Instead, each expert has a
|
| learnable bias term that adjusts during training to naturally balance the load
|
| across experts. This avoids the quality degradation that auxiliary load-balancing
|
| losses impose on the main language modeling objective.
|
|
|
| **Fine-grained experts.** 128 experts with small FFN dims (768) rather than
|
| fewer large experts. More experts means more specialization paths. At 8 active
|
| per token, the model blends 6.25% of expert capacity per forward pass.
|
|
|
| **Shared expert.** One expert is always active with 2x the FFN capacity of
|
| routed experts (1,536 dim). This acts as a dense fallback β knowledge that
|
| every token needs regardless of routing decisions. Proven effective by both
|
| DeepSeek-V3 and Nemotron 3.
|
|
|
| **Hash routing (planned).** DeepSeek-V4-Pro introduces hash-based candidate
|
| selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4,
|
| hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a
|
| learned hash function narrows candidates before the final top-k selection. This
|
| is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead
|
| from O(E) to O(log E).
|
|
|
| ## Training
|
|
|
| | Parameter | Value |
|
| |-----------|-------|
|
| | **Data tokens** | 15β20 trillion |
|
| | **Corpus** | Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) |
|
| | **TST** | Token Superposition Training, s=4, r=0.5 |
|
| | **Optimizer** | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
|
| | **Peak LR** | 4.5e-4 (MoE requires higher LR than dense) |
|
| | **Schedule** | Cosine decay to 10% of peak |
|
| | **Weight decay** | 0.1 |
|
| | **Batch size** | ~4M tokens/step |
|
| | **Precision** | BF16 weights, FP32 DeltaNet states, FP32 router logits |
|
| | **Hardware** | 256β512 H100/H200 GPUs (target) |
|
|
|
| ### Training Phases
|
|
|
| ```
|
| Phase 1 (TST, 7.5T tokens): Superposition mode, s=4 bags
|
| Multi-hot CE on all DeltaNet and full-attn layers
|
| MoE layers active with standard routing
|
|
|
| Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP
|
| Model recovers token-level precision
|
| Expert specialization deepens
|
|
|
| Phase 3 (Context Extension): Progressive 32K β 128K β 256K (~500B tokens)
|
| YaRN RoPE scaling
|
| Midtraining on long-document data
|
|
|
| Phase 4 (Annealing, ~500B): High-quality data upsample
|
| Decaying LR to 5e-6
|
| Final quality polish
|
| ```
|
|
|
| ## Expected Performance
|
|
|
| | Benchmark | Target | Comparison |
|
| |-----------|--------|-------------|
|
| | MMLU-Pro | 45β50% | Qwen3-8B: ~42% |
|
| | HellaSwag | 78β84% | Qwen3-8B: ~79% |
|
| | ARC-Challenge | 65β72% | Qwen3-8B: ~66% |
|
| | GPQA Diamond | 35β40% | Qwen3-8B: ~34% |
|
| | HumanEval (coding) | 55β65% | Qwen3-8B: ~58% |
|
| | MATH (reasoning) | 35β45% | Qwen3-8B: ~38% |
|
|
|
| Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer
|
| active parameters and supporting 256K native context. The MoE architecture
|
| extracts more quality per active parameter, and TST extracts more signal
|
| per training token.
|
|
|
| ### Inference Efficiency
|
|
|
| ```
|
| At 256K context, batch=1:
|
|
|
| Qwen3-8B (pure Transformer, 8B active):
|
| KV cache = 2 Γ 36 layers Γ 8 KV heads Γ 128 dim Γ 256K Γ 2 bytes = 37 GB
|
|
|
| BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):
|
| Full attn KV = 2 Γ 10 layers Γ 4 KV heads Γ 256 dim Γ 256K Γ 2 bytes = 10 GB
|
| DeltaNet states = 30 layers Γ 2 Γ 32 V heads Γ 128Β² dims Γ 4 bytes = 12 MB
|
| Total KV cache β 10 GB
|
|
|
| Result: 3.7x smaller KV cache at 256K context.
|
| ~4x fewer FLOPs per generated token.
|
| ```
|
|
|
| ## The BOREAL Family
|
|
|
| | Model | Params | Type | Context | Status |
|
| |-------|--------|------|---------|--------|
|
| | **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation |
|
| | **[BOREAL-2B](https://huggingface.co/DJLougen/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Community release |
|
| | **BOREAL-10B-MoE** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |
|
|
|
| ## How It Was Built
|
|
|
| ### Architecture Decisions
|
|
|
| ```
|
| DeltaNet hybrid: Validated by Qwen3.5/3.6 (May/Nov 2025)
|
| 3:1 linear ratio: Qwen3.5 proved this ratio for <10B models
|
| head_dim=256: Qwen3.5 and DeepSeek-V4 both moved to larger head dims
|
| partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway
|
| Swish output gates: Qwen3.6 addition, prevents attention blowup
|
| noaux_tc routing: DeepSeek-V3 proved auxiliary-loss-free load balancing
|
| Fine-grained MoE: 128 small experts > fewer large experts
|
| Shared expert: 1 always-active 2x-capacity expert = dense fallback
|
| TST training: Nous Research (arXiv:2605.06546), 1.5β2.5x speedup
|
| ```
|
|
|
| ### Data Philosophy
|
|
|
| BOREAL follows a data-quality-first approach. Pretraining data is the
|
| differentiator β everyone uses the same architectures now. Key principles:
|
|
|
| - **No LLM-generated pretraining data.** Only human-authored text in the
|
| base corpus. LLM-generated data is reserved for post-training.
|
| - **Structural curation.** Quality filtering goes beyond perplexity scoring
|
| to measure reasoning depth, self-correction density, and information content.
|
| - **Curriculum annealing.** High-quality data concentrated in the final
|
| annealing phase rather than diluted across the full run.
|
|
|
| ## Post-Training Pipeline (Planned)
|
|
|
| ```
|
| SFT: 2β5M high-quality instruction pairs
|
| Agentic reasoning, code, math, multilingual
|
|
|
| GRPO: Multi-reward RL with GDPO normalization
|
| Format + tool-use + reasoning depth + self-consistency + diversity
|
|
|
| Budget: Thinking budget mechanism (Qwen3 innovation)
|
| Dynamic compute allocation per query complexity
|
| ```
|
|
|
| ## License
|
|
|
| Apache 2.0.
|
|
|
| ## Author
|
|
|
| Developed by [DJLougen](https://huggingface.co/DJLougen).
|
|
|
| The BOREAL family is pretrained from scratch β no fine-tuning, no distillation,
|
| no inherited weights. Born in Toronto. Trained with Canadian stubbornness.
|
|
|
| [β Support on Ko-fi](https://ko-fi.com/djlougen) |