BOREAL-10B-MoE

Balanced Orthogonal Recurrent Expert Attention Layers

The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model with ~2 billion active parameters per token. Trained from scratch on 15–20 trillion tokens using Token Superposition Training.

BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6 with DeepSeek-V4's hash-based expert routing. The result: a model that punches at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and inference throughput competitive with models 4–5x its active size.

Architecture

Component	Detail
Type	Hybrid MoE — Gated DeltaNet + GQA + sparse experts
Total parameters	~10B
Active parameters	~2B per token
Hidden size	2,560
Layers	40 (10 full attention + 20 DeltaNet + 10 MoE FFN)
Full attention	GQA: 20 query heads, 4 KV heads, head_dim=256
DeltaNet	Gated linear attention: 8 QK heads, 32 V heads, head_dim=128
Conv kernel	4
Routed experts	128 total, 8 active per token
Expert FFN	SwiGLU, intermediate=768 per expert
Shared expert	1 always-active expert, intermediate=1,536
Expert routing	Sigmoid scoring + noaux_tc (zero auxiliary loss)
Dense FFN	SwiGLU, intermediate=7,680 (non-MoE layers)
Norm	RMSNorm, eps=1e-6
Position	RoPE, theta=10M, partial_rotary_factor=0.25
Output gate	Swish-gated attention and DeltaNet outputs
Vocab	151,936 (Qwen3 tokenizer)
Context	262,144 tokens native (256K)
MTP	1 multi-token prediction head

Layer Layout

Layer  0:   Full GQA attention + dense FFN        ← first layer always full attention
Layer  1:   Gated DeltaNet + dense FFN
Layer  2:   Gated DeltaNet + dense FFN
Layer  3:   Gated DeltaNet + dense FFN
Layer  4:   Full GQA attention + dense FFN         ← every 4th layer
Layer  5:   Gated DeltaNet + MoE FFN (128 experts)
Layer  6:   Gated DeltaNet + dense FFN
Layer  7:   Gated DeltaNet + MoE FFN (128 experts)
Layer  8:   Full GQA attention + dense FFN
Layer  9:   Gated DeltaNet + MoE FFN (128 experts)
...

Layer 36:   Full GQA attention + dense FFN
Layer 37:   Gated DeltaNet + MoE FFN (128 experts)
Layer 38:   Gated DeltaNet + dense FFN
Layer 39:   Gated DeltaNet + dense FFN             ← last layer dense

MoE layers are interleaved after the first 4 dense warmup layers, creating a gradient-rich environment where expert specialization can emerge naturally alongside the DeltaNet's recurrent state accumulation.

Expert Routing: DeepSeek-V4 Style

Sigmoid scoring. Unlike softmax routing (which forces a single dominant expert), sigmoid scoring allows multiple experts to activate independently. Each expert independently decides whether it can help with the current token.

noaux_tc. No auxiliary loss for load balancing. Instead, each expert has a learnable bias term that adjusts during training to naturally balance the load across experts. This avoids the quality degradation that auxiliary load-balancing losses impose on the main language modeling objective.

Fine-grained experts. 128 experts with small FFN dims (768) rather than fewer large experts. More experts means more specialization paths. At 8 active per token, the model blends 6.25% of expert capacity per forward pass.

Shared expert. One expert is always active with 2x the FFN capacity of routed experts (1,536 dim). This acts as a dense fallback — knowledge that every token needs regardless of routing decisions. Proven effective by both DeepSeek-V3 and Nemotron 3.

Hash routing (planned). DeepSeek-V4-Pro introduces hash-based candidate selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4, hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a learned hash function narrows candidates before the final top-k selection. This is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead from O(E) to O(log E).

Training

Parameter	Value
Data tokens	15–20 trillion
Corpus	Web text (50%), code (20%), STEM/academic (15%), multilingual (15%)
TST	Token Superposition Training, s=4, r=0.5
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Peak LR	4.5e-4 (MoE requires higher LR than dense)
Schedule	Cosine decay to 10% of peak
Weight decay	0.1
Batch size	~4M tokens/step
Precision	BF16 weights, FP32 DeltaNet states, FP32 router logits
Hardware	256–512 H100/H200 GPUs (target)

Training Phases

Phase 1 (TST, 7.5T tokens):     Superposition mode, s=4 bags
                                 Multi-hot CE on all DeltaNet and full-attn layers
                                 MoE layers active with standard routing

Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP
                                  Model recovers token-level precision
                                  Expert specialization deepens

Phase 3 (Context Extension):      Progressive 32K → 128K → 256K (~500B tokens)
                                  YaRN RoPE scaling
                                  Midtraining on long-document data

Phase 4 (Annealing, ~500B):      High-quality data upsample
                                  Decaying LR to 5e-6
                                  Final quality polish

Expected Performance

Benchmark	Target	Comparison
MMLU-Pro	45–50%	Qwen3-8B: ~42%
HellaSwag	78–84%	Qwen3-8B: ~79%
ARC-Challenge	65–72%	Qwen3-8B: ~66%
GPQA Diamond	35–40%	Qwen3-8B: ~34%
HumanEval (coding)	55–65%	Qwen3-8B: ~58%
MATH (reasoning)	35–45%	Qwen3-8B: ~38%

Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer active parameters and supporting 256K native context. The MoE architecture extracts more quality per active parameter, and TST extracts more signal per training token.

Inference Efficiency

At 256K context, batch=1:

Qwen3-8B (pure Transformer, 8B active):
  KV cache = 2 × 36 layers × 8 KV heads × 128 dim × 256K × 2 bytes = 37 GB

BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):
  Full attn KV = 2 × 10 layers × 4 KV heads × 256 dim × 256K × 2 bytes = 10 GB
  DeltaNet states = 30 layers × 2 × 32 V heads × 128² dims × 4 bytes = 12 MB
  Total KV cache ≈ 10 GB

Result: 3.7x smaller KV cache at 256K context.
        ~4x fewer FLOPs per generated token.

The BOREAL Family

Model	Params	Type	Context	Status
BOREAL-250M	250M	Dense DeltaNet	32K	Architecture validation
BOREAL-2B	2B	Dense DeltaNet	64K	Community release
BOREAL-10B-MoE	~10B / ~2B active	DeltaNet + MoE	256K	Target model

How It Was Built

Architecture Decisions

DeltaNet hybrid:     Validated by Qwen3.5/3.6 (May/Nov 2025)
3:1 linear ratio:    Qwen3.5 proved this ratio for <10B models
head_dim=256:        Qwen3.5 and DeepSeek-V4 both moved to larger head dims
partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway
Swish output gates:  Qwen3.6 addition, prevents attention blowup
noaux_tc routing:    DeepSeek-V3 proved auxiliary-loss-free load balancing
Fine-grained MoE:    128 small experts > fewer large experts
Shared expert:       1 always-active 2x-capacity expert = dense fallback
TST training:        Nous Research (arXiv:2605.06546), 1.5–2.5x speedup

Data Philosophy

BOREAL follows a data-quality-first approach. Pretraining data is the differentiator — everyone uses the same architectures now. Key principles:

No LLM-generated pretraining data. Only human-authored text in the base corpus. LLM-generated data is reserved for post-training.
Structural curation. Quality filtering goes beyond perplexity scoring to measure reasoning depth, self-correction density, and information content.
Curriculum annealing. High-quality data concentrated in the final annealing phase rather than diluted across the full run.

Post-Training Pipeline (Planned)

SFT:        2–5M high-quality instruction pairs
            Agentic reasoning, code, math, multilingual

GRPO:       Multi-reward RL with GDPO normalization
            Format + tool-use + reasoning depth + self-consistency + diversity

Budget:     Thinking budget mechanism (Qwen3 innovation)
            Dynamic compute allocation per query complexity

License

Apache 2.0.

Author

Developed by DJLougen.

The BOREAL family is pretrained from scratch — no fine-tuning, no distillation, no inherited weights. Born in Toronto. Trained with Canadian stubbornness.

☕ Support on Ko-fi

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for GestaltLabs/BOREAL-10B-MoE

Efficient Pre-Training with Token Superposition

Paper • 2605.06546 • Published 7 days ago • 29