--- language: - en license: apache-2.0 library_name: transformers tags: - boreal - deltanet - hybrid - moe - mixture-of-experts - linear-attention - swiglu - rmsnorm - rope - gqa - deepseek-routing - hash-routing - pretraining - canada pipeline_tag: text-generation base_model: DJLougen/BOREAL-10B-MoE --- ![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE/resolve/main/Boreal.png) # BOREAL-10B-MoE **B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model with ~2 billion active parameters per token. Trained from scratch on 15–20 trillion tokens using Token Superposition Training. BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6 with DeepSeek-V4's hash-based expert routing. The result: a model that punches at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and inference throughput competitive with models 4–5x its active size. ## Architecture | Component | Detail | |-----------|--------| | **Type** | Hybrid MoE — Gated DeltaNet + GQA + sparse experts | | **Total parameters** | ~10B | | **Active parameters** | ~2B per token | | **Hidden size** | 2,560 | | **Layers** | 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) | | **Full attention** | GQA: 20 query heads, 4 KV heads, head_dim=256 | | **DeltaNet** | Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 | | **Conv kernel** | 4 | | **Routed experts** | 128 total, 8 active per token | | **Expert FFN** | SwiGLU, intermediate=768 per expert | | **Shared expert** | 1 always-active expert, intermediate=1,536 | | **Expert routing** | Sigmoid scoring + noaux_tc (zero auxiliary loss) | | **Dense FFN** | SwiGLU, intermediate=7,680 (non-MoE layers) | | **Norm** | RMSNorm, eps=1e-6 | | **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 | | **Output gate** | Swish-gated attention and DeltaNet outputs | | **Vocab** | 151,936 (Qwen3 tokenizer) | | **Context** | 262,144 tokens native (256K) | | **MTP** | 1 multi-token prediction head | ### Layer Layout ``` Layer 0: Full GQA attention + dense FFN ← first layer always full attention Layer 1: Gated DeltaNet + dense FFN Layer 2: Gated DeltaNet + dense FFN Layer 3: Gated DeltaNet + dense FFN Layer 4: Full GQA attention + dense FFN ← every 4th layer Layer 5: Gated DeltaNet + MoE FFN (128 experts) Layer 6: Gated DeltaNet + dense FFN Layer 7: Gated DeltaNet + MoE FFN (128 experts) Layer 8: Full GQA attention + dense FFN Layer 9: Gated DeltaNet + MoE FFN (128 experts) ... Layer 36: Full GQA attention + dense FFN Layer 37: Gated DeltaNet + MoE FFN (128 experts) Layer 38: Gated DeltaNet + dense FFN Layer 39: Gated DeltaNet + dense FFN ← last layer dense ``` MoE layers are interleaved after the first 4 dense warmup layers, creating a gradient-rich environment where expert specialization can emerge naturally alongside the DeltaNet's recurrent state accumulation. ### Expert Routing: DeepSeek-V4 Style **Sigmoid scoring.** Unlike softmax routing (which forces a single dominant expert), sigmoid scoring allows multiple experts to activate independently. Each expert independently decides whether it can help with the current token. **noaux_tc.** No auxiliary loss for load balancing. Instead, each expert has a learnable bias term that adjusts during training to naturally balance the load across experts. This avoids the quality degradation that auxiliary load-balancing losses impose on the main language modeling objective. **Fine-grained experts.** 128 experts with small FFN dims (768) rather than fewer large experts. More experts means more specialization paths. At 8 active per token, the model blends 6.25% of expert capacity per forward pass. **Shared expert.** One expert is always active with 2x the FFN capacity of routed experts (1,536 dim). This acts as a dense fallback — knowledge that every token needs regardless of routing decisions. Proven effective by both DeepSeek-V3 and Nemotron 3. **Hash routing (planned).** DeepSeek-V4-Pro introduces hash-based candidate selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4, hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a learned hash function narrows candidates before the final top-k selection. This is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead from O(E) to O(log E). ## Training | Parameter | Value | |-----------|-------| | **Data tokens** | 15–20 trillion | | **Corpus** | Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) | | **TST** | Token Superposition Training, s=4, r=0.5 | | **Optimizer** | AdamW (β₁=0.9, β₂=0.95) | | **Peak LR** | 4.5e-4 (MoE requires higher LR than dense) | | **Schedule** | Cosine decay to 10% of peak | | **Weight decay** | 0.1 | | **Batch size** | ~4M tokens/step | | **Precision** | BF16 weights, FP32 DeltaNet states, FP32 router logits | | **Hardware** | 256–512 H100/H200 GPUs (target) | ### Training Phases ``` Phase 1 (TST, 7.5T tokens): Superposition mode, s=4 bags Multi-hot CE on all DeltaNet and full-attn layers MoE layers active with standard routing Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP Model recovers token-level precision Expert specialization deepens Phase 3 (Context Extension): Progressive 32K → 128K → 256K (~500B tokens) YaRN RoPE scaling Midtraining on long-document data Phase 4 (Annealing, ~500B): High-quality data upsample Decaying LR to 5e-6 Final quality polish ``` ## Expected Performance | Benchmark | Target | Comparison | |-----------|--------|-------------| | MMLU-Pro | 45–50% | Qwen3-8B: ~42% | | HellaSwag | 78–84% | Qwen3-8B: ~79% | | ARC-Challenge | 65–72% | Qwen3-8B: ~66% | | GPQA Diamond | 35–40% | Qwen3-8B: ~34% | | HumanEval (coding) | 55–65% | Qwen3-8B: ~58% | | MATH (reasoning) | 35–45% | Qwen3-8B: ~38% | Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer active parameters and supporting 256K native context. The MoE architecture extracts more quality per active parameter, and TST extracts more signal per training token. ### Inference Efficiency ``` At 256K context, batch=1: Qwen3-8B (pure Transformer, 8B active): KV cache = 2 × 36 layers × 8 KV heads × 128 dim × 256K × 2 bytes = 37 GB BOREAL-10B-MoE (DeltaNet hybrid, ~2B active): Full attn KV = 2 × 10 layers × 4 KV heads × 256 dim × 256K × 2 bytes = 10 GB DeltaNet states = 30 layers × 2 × 32 V heads × 128² dims × 4 bytes = 12 MB Total KV cache ≈ 10 GB Result: 3.7x smaller KV cache at 256K context. ~4x fewer FLOPs per generated token. ``` ## The BOREAL Family | Model | Params | Type | Context | Status | |-------|--------|------|---------|--------| | **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation | | **[BOREAL-2B](https://huggingface.co/DJLougen/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Community release | | **BOREAL-10B-MoE** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model | ## How It Was Built ### Architecture Decisions ``` DeltaNet hybrid: Validated by Qwen3.5/3.6 (May/Nov 2025) 3:1 linear ratio: Qwen3.5 proved this ratio for <10B models head_dim=256: Qwen3.5 and DeepSeek-V4 both moved to larger head dims partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway Swish output gates: Qwen3.6 addition, prevents attention blowup noaux_tc routing: DeepSeek-V3 proved auxiliary-loss-free load balancing Fine-grained MoE: 128 small experts > fewer large experts Shared expert: 1 always-active 2x-capacity expert = dense fallback TST training: Nous Research (arXiv:2605.06546), 1.5–2.5x speedup ``` ### Data Philosophy BOREAL follows a data-quality-first approach. Pretraining data is the differentiator — everyone uses the same architectures now. Key principles: - **No LLM-generated pretraining data.** Only human-authored text in the base corpus. LLM-generated data is reserved for post-training. - **Structural curation.** Quality filtering goes beyond perplexity scoring to measure reasoning depth, self-correction density, and information content. - **Curriculum annealing.** High-quality data concentrated in the final annealing phase rather than diluted across the full run. ## Post-Training Pipeline (Planned) ``` SFT: 2–5M high-quality instruction pairs Agentic reasoning, code, math, multilingual GRPO: Multi-reward RL with GDPO normalization Format + tool-use + reasoning depth + self-consistency + diversity Budget: Thinking budget mechanism (Qwen3 innovation) Dynamic compute allocation per query complexity ``` ## License Apache 2.0. ## Author Developed by [DJLougen](https://huggingface.co/DJLougen). The BOREAL family is pretrained from scratch — no fine-tuning, no distillation, no inherited weights. Born in Toronto. Trained with Canadian stubbornness. [☕ Support on Ko-fi](https://ko-fi.com/djlougen)