BOREAL

BOREAL-10B-MoE

Balanced Orthogonal Recurrent Expert Attention Layers

The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model with ~2 billion active parameters per token. Trained from scratch on 15–20 trillion tokens using Token Superposition Training.

BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6 with DeepSeek-V4's hash-based expert routing. The result: a model that punches at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and inference throughput competitive with models 4–5x its active size.

Architecture

Component Detail
Type Hybrid MoE β€” Gated DeltaNet + GQA + sparse experts
Total parameters ~10B
Active parameters ~2B per token
Hidden size 2,560
Layers 40 (10 full attention + 20 DeltaNet + 10 MoE FFN)
Full attention GQA: 20 query heads, 4 KV heads, head_dim=256
DeltaNet Gated linear attention: 8 QK heads, 32 V heads, head_dim=128
Conv kernel 4
Routed experts 128 total, 8 active per token
Expert FFN SwiGLU, intermediate=768 per expert
Shared expert 1 always-active expert, intermediate=1,536
Expert routing Sigmoid scoring + noaux_tc (zero auxiliary loss)
Dense FFN SwiGLU, intermediate=7,680 (non-MoE layers)
Norm RMSNorm, eps=1e-6
Position RoPE, theta=10M, partial_rotary_factor=0.25
Output gate Swish-gated attention and DeltaNet outputs
Vocab 151,936 (Qwen3 tokenizer)
Context 262,144 tokens native (256K)
MTP 1 multi-token prediction head

Layer Layout

Layer  0:   Full GQA attention + dense FFN        ← first layer always full attention
Layer  1:   Gated DeltaNet + dense FFN
Layer  2:   Gated DeltaNet + dense FFN
Layer  3:   Gated DeltaNet + dense FFN
Layer  4:   Full GQA attention + dense FFN         ← every 4th layer
Layer  5:   Gated DeltaNet + MoE FFN (128 experts)
Layer  6:   Gated DeltaNet + dense FFN
Layer  7:   Gated DeltaNet + MoE FFN (128 experts)
Layer  8:   Full GQA attention + dense FFN
Layer  9:   Gated DeltaNet + MoE FFN (128 experts)
...

Layer 36:   Full GQA attention + dense FFN
Layer 37:   Gated DeltaNet + MoE FFN (128 experts)
Layer 38:   Gated DeltaNet + dense FFN
Layer 39:   Gated DeltaNet + dense FFN             ← last layer dense

MoE layers are interleaved after the first 4 dense warmup layers, creating a gradient-rich environment where expert specialization can emerge naturally alongside the DeltaNet's recurrent state accumulation.

Expert Routing: DeepSeek-V4 Style

Sigmoid scoring. Unlike softmax routing (which forces a single dominant expert), sigmoid scoring allows multiple experts to activate independently. Each expert independently decides whether it can help with the current token.

noaux_tc. No auxiliary loss for load balancing. Instead, each expert has a learnable bias term that adjusts during training to naturally balance the load across experts. This avoids the quality degradation that auxiliary load-balancing losses impose on the main language modeling objective.

Fine-grained experts. 128 experts with small FFN dims (768) rather than fewer large experts. More experts means more specialization paths. At 8 active per token, the model blends 6.25% of expert capacity per forward pass.

Shared expert. One expert is always active with 2x the FFN capacity of routed experts (1,536 dim). This acts as a dense fallback β€” knowledge that every token needs regardless of routing decisions. Proven effective by both DeepSeek-V3 and Nemotron 3.

Hash routing (planned). DeepSeek-V4-Pro introduces hash-based candidate selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4, hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a learned hash function narrows candidates before the final top-k selection. This is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead from O(E) to O(log E).

Training

Parameter Value
Data tokens 15–20 trillion
Corpus Web text (50%), code (20%), STEM/academic (15%), multilingual (15%)
TST Token Superposition Training, s=4, r=0.5
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95)
Peak LR 4.5e-4 (MoE requires higher LR than dense)
Schedule Cosine decay to 10% of peak
Weight decay 0.1
Batch size ~4M tokens/step
Precision BF16 weights, FP32 DeltaNet states, FP32 router logits
Hardware 256–512 H100/H200 GPUs (target)

Training Phases

Phase 1 (TST, 7.5T tokens):     Superposition mode, s=4 bags
                                 Multi-hot CE on all DeltaNet and full-attn layers
                                 MoE layers active with standard routing

Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP
                                  Model recovers token-level precision
                                  Expert specialization deepens

Phase 3 (Context Extension):      Progressive 32K β†’ 128K β†’ 256K (~500B tokens)
                                  YaRN RoPE scaling
                                  Midtraining on long-document data

Phase 4 (Annealing, ~500B):      High-quality data upsample
                                  Decaying LR to 5e-6
                                  Final quality polish

Expected Performance

Benchmark Target Comparison
MMLU-Pro 45–50% Qwen3-8B: ~42%
HellaSwag 78–84% Qwen3-8B: ~79%
ARC-Challenge 65–72% Qwen3-8B: ~66%
GPQA Diamond 35–40% Qwen3-8B: ~34%
HumanEval (coding) 55–65% Qwen3-8B: ~58%
MATH (reasoning) 35–45% Qwen3-8B: ~38%

Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer active parameters and supporting 256K native context. The MoE architecture extracts more quality per active parameter, and TST extracts more signal per training token.

Inference Efficiency

At 256K context, batch=1:

Qwen3-8B (pure Transformer, 8B active):
  KV cache = 2 Γ— 36 layers Γ— 8 KV heads Γ— 128 dim Γ— 256K Γ— 2 bytes = 37 GB

BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):
  Full attn KV = 2 Γ— 10 layers Γ— 4 KV heads Γ— 256 dim Γ— 256K Γ— 2 bytes = 10 GB
  DeltaNet states = 30 layers Γ— 2 Γ— 32 V heads Γ— 128Β² dims Γ— 4 bytes = 12 MB
  Total KV cache β‰ˆ 10 GB

Result: 3.7x smaller KV cache at 256K context.
        ~4x fewer FLOPs per generated token.

The BOREAL Family

Model Params Type Context Status
BOREAL-250M 250M Dense DeltaNet 32K Architecture validation
BOREAL-2B 2B Dense DeltaNet 64K Community release
BOREAL-10B-MoE ~10B / ~2B active DeltaNet + MoE 256K Target model

How It Was Built

Architecture Decisions

DeltaNet hybrid:     Validated by Qwen3.5/3.6 (May/Nov 2025)
3:1 linear ratio:    Qwen3.5 proved this ratio for <10B models
head_dim=256:        Qwen3.5 and DeepSeek-V4 both moved to larger head dims
partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway
Swish output gates:  Qwen3.6 addition, prevents attention blowup
noaux_tc routing:    DeepSeek-V3 proved auxiliary-loss-free load balancing
Fine-grained MoE:    128 small experts > fewer large experts
Shared expert:       1 always-active 2x-capacity expert = dense fallback
TST training:        Nous Research (arXiv:2605.06546), 1.5–2.5x speedup

Data Philosophy

BOREAL follows a data-quality-first approach. Pretraining data is the differentiator β€” everyone uses the same architectures now. Key principles:

  • No LLM-generated pretraining data. Only human-authored text in the base corpus. LLM-generated data is reserved for post-training.
  • Structural curation. Quality filtering goes beyond perplexity scoring to measure reasoning depth, self-correction density, and information content.
  • Curriculum annealing. High-quality data concentrated in the final annealing phase rather than diluted across the full run.

Post-Training Pipeline (Planned)

SFT:        2–5M high-quality instruction pairs
            Agentic reasoning, code, math, multilingual

GRPO:       Multi-reward RL with GDPO normalization
            Format + tool-use + reasoning depth + self-consistency + diversity

Budget:     Thinking budget mechanism (Qwen3 innovation)
            Dynamic compute allocation per query complexity

License

Apache 2.0.

Author

Developed by DJLougen.

The BOREAL family is pretrained from scratch β€” no fine-tuning, no distillation, no inherited weights. Born in Toronto. Trained with Canadian stubbornness.

β˜• Support on Ko-fi

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for GestaltLabs/BOREAL-10B-MoE