File size: 9,679 Bytes

---

language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- moe
- mixture-of-experts
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- deepseek-routing
- hash-routing
- pretraining
- canada
pipeline_tag: text-generation
base_model: DJLougen/BOREAL-10B-MoE
---


![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE/resolve/main/Boreal.png)

# BOREAL-10B-MoE

**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers

The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model
with ~2 billion active parameters per token. Trained from scratch on 15–20
trillion tokens using Token Superposition Training.

BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6
with DeepSeek-V4's hash-based expert routing. The result: a model that punches
at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and
inference throughput competitive with models 4–5x its active size.

## Architecture

| Component | Detail |
|-----------|--------|
| **Type** | Hybrid MoE — Gated DeltaNet + GQA + sparse experts |
| **Total parameters** | ~10B |
| **Active parameters** | ~2B per token |
| **Hidden size** | 2,560 |
| **Layers** | 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) |
| **Full attention** | GQA: 20 query heads, 4 KV heads, head_dim=256 |

| **DeltaNet** | Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 |
| **Conv kernel** | 4 |
| **Routed experts** | 128 total, 8 active per token |
| **Expert FFN** | SwiGLU, intermediate=768 per expert |
| **Shared expert** | 1 always-active expert, intermediate=1,536 |
| **Expert routing** | Sigmoid scoring + noaux_tc (zero auxiliary loss) |

| **Dense FFN** | SwiGLU, intermediate=7,680 (non-MoE layers) |

| **Norm** | RMSNorm, eps=1e-6 |

| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |

| **Output gate** | Swish-gated attention and DeltaNet outputs |

| **Vocab** | 151,936 (Qwen3 tokenizer) |

| **Context** | 262,144 tokens native (256K) |

| **MTP** | 1 multi-token prediction head |



### Layer Layout



```

Layer  0:   Full GQA attention + dense FFN        ← first layer always full attention

Layer  1:   Gated DeltaNet + dense FFN

Layer  2:   Gated DeltaNet + dense FFN

Layer  3:   Gated DeltaNet + dense FFN

Layer  4:   Full GQA attention + dense FFN         ← every 4th layer

Layer  5:   Gated DeltaNet + MoE FFN (128 experts)

Layer  6:   Gated DeltaNet + dense FFN

Layer  7:   Gated DeltaNet + MoE FFN (128 experts)

Layer  8:   Full GQA attention + dense FFN

Layer  9:   Gated DeltaNet + MoE FFN (128 experts)

...



Layer 36:   Full GQA attention + dense FFN

Layer 37:   Gated DeltaNet + MoE FFN (128 experts)

Layer 38:   Gated DeltaNet + dense FFN

Layer 39:   Gated DeltaNet + dense FFN             ← last layer dense

```



MoE layers are interleaved after the first 4 dense warmup layers, creating a

gradient-rich environment where expert specialization can emerge naturally

alongside the DeltaNet's recurrent state accumulation.



### Expert Routing: DeepSeek-V4 Style



**Sigmoid scoring.** Unlike softmax routing (which forces a single dominant

expert), sigmoid scoring allows multiple experts to activate independently.

Each expert independently decides whether it can help with the current token.



**noaux_tc.** No auxiliary loss for load balancing. Instead, each expert has a

learnable bias term that adjusts during training to naturally balance the load

across experts. This avoids the quality degradation that auxiliary load-balancing

losses impose on the main language modeling objective.



**Fine-grained experts.** 128 experts with small FFN dims (768) rather than

fewer large experts. More experts means more specialization paths. At 8 active

per token, the model blends 6.25% of expert capacity per forward pass.



**Shared expert.** One expert is always active with 2x the FFN capacity of

routed experts (1,536 dim). This acts as a dense fallback — knowledge that

every token needs regardless of routing decisions. Proven effective by both

DeepSeek-V3 and Nemotron 3.



**Hash routing (planned).** DeepSeek-V4-Pro introduces hash-based candidate

selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4,
hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a
learned hash function narrows candidates before the final top-k selection. This
is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead
from O(E) to O(log E).

## Training

| Parameter | Value |
|-----------|-------|
| **Data tokens** | 15–20 trillion |
| **Corpus** | Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) |
| **TST** | Token Superposition Training, s=4, r=0.5 |
| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
| **Peak LR** | 4.5e-4 (MoE requires higher LR than dense) |
| **Schedule** | Cosine decay to 10% of peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states, FP32 router logits |
| **Hardware** | 256–512 H100/H200 GPUs (target) |

### Training Phases

```

Phase 1 (TST, 7.5T tokens):     Superposition mode, s=4 bags

                                 Multi-hot CE on all DeltaNet and full-attn layers

                                 MoE layers active with standard routing



Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP

                                  Model recovers token-level precision

                                  Expert specialization deepens



Phase 3 (Context Extension):      Progressive 32K → 128K → 256K (~500B tokens)

                                  YaRN RoPE scaling

                                  Midtraining on long-document data



Phase 4 (Annealing, ~500B):      High-quality data upsample

                                  Decaying LR to 5e-6

                                  Final quality polish

```

## Expected Performance

| Benchmark | Target | Comparison |
|-----------|--------|-------------|
| MMLU-Pro | 45–50% | Qwen3-8B: ~42% |
| HellaSwag | 78–84% | Qwen3-8B: ~79% |
| ARC-Challenge | 65–72% | Qwen3-8B: ~66% |
| GPQA Diamond | 35–40% | Qwen3-8B: ~34% |
| HumanEval (coding) | 55–65% | Qwen3-8B: ~58% |
| MATH (reasoning) | 35–45% | Qwen3-8B: ~38% |

Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer
active parameters and supporting 256K native context. The MoE architecture
extracts more quality per active parameter, and TST extracts more signal
per training token.

### Inference Efficiency

```

At 256K context, batch=1:



Qwen3-8B (pure Transformer, 8B active):

  KV cache = 2 × 36 layers × 8 KV heads × 128 dim × 256K × 2 bytes = 37 GB



BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):

  Full attn KV = 2 × 10 layers × 4 KV heads × 256 dim × 256K × 2 bytes = 10 GB

  DeltaNet states = 30 layers × 2 × 32 V heads × 128² dims × 4 bytes = 12 MB

  Total KV cache ≈ 10 GB



Result: 3.7x smaller KV cache at 256K context.

        ~4x fewer FLOPs per generated token.

```

## The BOREAL Family

| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation |
| **[BOREAL-2B](https://huggingface.co/DJLougen/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Community release |
| **BOREAL-10B-MoE** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |

## How It Was Built

### Architecture Decisions

```

DeltaNet hybrid:     Validated by Qwen3.5/3.6 (May/Nov 2025)

3:1 linear ratio:    Qwen3.5 proved this ratio for <10B models

head_dim=256:        Qwen3.5 and DeepSeek-V4 both moved to larger head dims

partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway

Swish output gates:  Qwen3.6 addition, prevents attention blowup

noaux_tc routing:    DeepSeek-V3 proved auxiliary-loss-free load balancing

Fine-grained MoE:    128 small experts > fewer large experts

Shared expert:       1 always-active 2x-capacity expert = dense fallback

TST training:        Nous Research (arXiv:2605.06546), 1.5–2.5x speedup

```

### Data Philosophy

BOREAL follows a data-quality-first approach. Pretraining data is the
differentiator — everyone uses the same architectures now. Key principles:

- **No LLM-generated pretraining data.** Only human-authored text in the
  base corpus. LLM-generated data is reserved for post-training.
- **Structural curation.** Quality filtering goes beyond perplexity scoring
  to measure reasoning depth, self-correction density, and information content.
- **Curriculum annealing.** High-quality data concentrated in the final
  annealing phase rather than diluted across the full run.

## Post-Training Pipeline (Planned)

```

SFT:        2–5M high-quality instruction pairs

            Agentic reasoning, code, math, multilingual



GRPO:       Multi-reward RL with GDPO normalization

            Format + tool-use + reasoning depth + self-consistency + diversity



Budget:     Thinking budget mechanism (Qwen3 innovation)

            Dynamic compute allocation per query complexity

```

## License

Apache 2.0.

## Author

Developed by [DJLougen](https://huggingface.co/DJLougen).

The BOREAL family is pretrained from scratch — no fine-tuning, no distillation,
no inherited weights. Born in Toronto. Trained with Canadian stubbornness.

[☕ Support on Ko-fi](https://ko-fi.com/djlougen)