File size: 9,679 Bytes
9855b1d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2091e2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---

language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- moe
- mixture-of-experts
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- deepseek-routing
- hash-routing
- pretraining
- canada
pipeline_tag: text-generation
base_model: DJLougen/BOREAL-10B-MoE
---


![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE/resolve/main/Boreal.png)

# BOREAL-10B-MoE

**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers

The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model
with ~2 billion active parameters per token. Trained from scratch on 15–20
trillion tokens using Token Superposition Training.

BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6
with DeepSeek-V4's hash-based expert routing. The result: a model that punches
at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and
inference throughput competitive with models 4–5x its active size.

## Architecture

| Component | Detail |
|-----------|--------|
| **Type** | Hybrid MoE β€” Gated DeltaNet + GQA + sparse experts |
| **Total parameters** | ~10B |
| **Active parameters** | ~2B per token |
| **Hidden size** | 2,560 |
| **Layers** | 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) |
| **Full attention** | GQA: 20 query heads, 4 KV heads, head_dim=256 |

| **DeltaNet** | Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 |
| **Conv kernel** | 4 |
| **Routed experts** | 128 total, 8 active per token |
| **Expert FFN** | SwiGLU, intermediate=768 per expert |
| **Shared expert** | 1 always-active expert, intermediate=1,536 |
| **Expert routing** | Sigmoid scoring + noaux_tc (zero auxiliary loss) |

| **Dense FFN** | SwiGLU, intermediate=7,680 (non-MoE layers) |

| **Norm** | RMSNorm, eps=1e-6 |

| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |

| **Output gate** | Swish-gated attention and DeltaNet outputs |

| **Vocab** | 151,936 (Qwen3 tokenizer) |

| **Context** | 262,144 tokens native (256K) |

| **MTP** | 1 multi-token prediction head |



### Layer Layout



```

Layer  0:   Full GQA attention + dense FFN        ← first layer always full attention

Layer  1:   Gated DeltaNet + dense FFN

Layer  2:   Gated DeltaNet + dense FFN

Layer  3:   Gated DeltaNet + dense FFN

Layer  4:   Full GQA attention + dense FFN         ← every 4th layer

Layer  5:   Gated DeltaNet + MoE FFN (128 experts)

Layer  6:   Gated DeltaNet + dense FFN

Layer  7:   Gated DeltaNet + MoE FFN (128 experts)

Layer  8:   Full GQA attention + dense FFN

Layer  9:   Gated DeltaNet + MoE FFN (128 experts)

...



Layer 36:   Full GQA attention + dense FFN

Layer 37:   Gated DeltaNet + MoE FFN (128 experts)

Layer 38:   Gated DeltaNet + dense FFN

Layer 39:   Gated DeltaNet + dense FFN             ← last layer dense

```



MoE layers are interleaved after the first 4 dense warmup layers, creating a

gradient-rich environment where expert specialization can emerge naturally

alongside the DeltaNet's recurrent state accumulation.



### Expert Routing: DeepSeek-V4 Style



**Sigmoid scoring.** Unlike softmax routing (which forces a single dominant

expert), sigmoid scoring allows multiple experts to activate independently.

Each expert independently decides whether it can help with the current token.



**noaux_tc.** No auxiliary loss for load balancing. Instead, each expert has a

learnable bias term that adjusts during training to naturally balance the load

across experts. This avoids the quality degradation that auxiliary load-balancing

losses impose on the main language modeling objective.



**Fine-grained experts.** 128 experts with small FFN dims (768) rather than

fewer large experts. More experts means more specialization paths. At 8 active

per token, the model blends 6.25% of expert capacity per forward pass.



**Shared expert.** One expert is always active with 2x the FFN capacity of

routed experts (1,536 dim). This acts as a dense fallback β€” knowledge that

every token needs regardless of routing decisions. Proven effective by both

DeepSeek-V3 and Nemotron 3.



**Hash routing (planned).** DeepSeek-V4-Pro introduces hash-based candidate

selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4,
hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a
learned hash function narrows candidates before the final top-k selection. This
is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead
from O(E) to O(log E).

## Training

| Parameter | Value |
|-----------|-------|
| **Data tokens** | 15–20 trillion |
| **Corpus** | Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) |
| **TST** | Token Superposition Training, s=4, r=0.5 |
| **Optimizer** | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
| **Peak LR** | 4.5e-4 (MoE requires higher LR than dense) |
| **Schedule** | Cosine decay to 10% of peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states, FP32 router logits |
| **Hardware** | 256–512 H100/H200 GPUs (target) |

### Training Phases

```

Phase 1 (TST, 7.5T tokens):     Superposition mode, s=4 bags

                                 Multi-hot CE on all DeltaNet and full-attn layers

                                 MoE layers active with standard routing



Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP

                                  Model recovers token-level precision

                                  Expert specialization deepens



Phase 3 (Context Extension):      Progressive 32K β†’ 128K β†’ 256K (~500B tokens)

                                  YaRN RoPE scaling

                                  Midtraining on long-document data



Phase 4 (Annealing, ~500B):      High-quality data upsample

                                  Decaying LR to 5e-6

                                  Final quality polish

```

## Expected Performance

| Benchmark | Target | Comparison |
|-----------|--------|-------------|
| MMLU-Pro | 45–50% | Qwen3-8B: ~42% |
| HellaSwag | 78–84% | Qwen3-8B: ~79% |
| ARC-Challenge | 65–72% | Qwen3-8B: ~66% |
| GPQA Diamond | 35–40% | Qwen3-8B: ~34% |
| HumanEval (coding) | 55–65% | Qwen3-8B: ~58% |
| MATH (reasoning) | 35–45% | Qwen3-8B: ~38% |

Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer
active parameters and supporting 256K native context. The MoE architecture
extracts more quality per active parameter, and TST extracts more signal
per training token.

### Inference Efficiency

```

At 256K context, batch=1:



Qwen3-8B (pure Transformer, 8B active):

  KV cache = 2 Γ— 36 layers Γ— 8 KV heads Γ— 128 dim Γ— 256K Γ— 2 bytes = 37 GB



BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):

  Full attn KV = 2 Γ— 10 layers Γ— 4 KV heads Γ— 256 dim Γ— 256K Γ— 2 bytes = 10 GB

  DeltaNet states = 30 layers Γ— 2 Γ— 32 V heads Γ— 128Β² dims Γ— 4 bytes = 12 MB

  Total KV cache β‰ˆ 10 GB



Result: 3.7x smaller KV cache at 256K context.

        ~4x fewer FLOPs per generated token.

```

## The BOREAL Family

| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation |
| **[BOREAL-2B](https://huggingface.co/DJLougen/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Community release |
| **BOREAL-10B-MoE** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |

## How It Was Built

### Architecture Decisions

```

DeltaNet hybrid:     Validated by Qwen3.5/3.6 (May/Nov 2025)

3:1 linear ratio:    Qwen3.5 proved this ratio for <10B models

head_dim=256:        Qwen3.5 and DeepSeek-V4 both moved to larger head dims

partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway

Swish output gates:  Qwen3.6 addition, prevents attention blowup

noaux_tc routing:    DeepSeek-V3 proved auxiliary-loss-free load balancing

Fine-grained MoE:    128 small experts > fewer large experts

Shared expert:       1 always-active 2x-capacity expert = dense fallback

TST training:        Nous Research (arXiv:2605.06546), 1.5–2.5x speedup

```

### Data Philosophy

BOREAL follows a data-quality-first approach. Pretraining data is the
differentiator β€” everyone uses the same architectures now. Key principles:

- **No LLM-generated pretraining data.** Only human-authored text in the
  base corpus. LLM-generated data is reserved for post-training.
- **Structural curation.** Quality filtering goes beyond perplexity scoring
  to measure reasoning depth, self-correction density, and information content.
- **Curriculum annealing.** High-quality data concentrated in the final
  annealing phase rather than diluted across the full run.

## Post-Training Pipeline (Planned)

```

SFT:        2–5M high-quality instruction pairs

            Agentic reasoning, code, math, multilingual



GRPO:       Multi-reward RL with GDPO normalization

            Format + tool-use + reasoning depth + self-consistency + diversity



Budget:     Thinking budget mechanism (Qwen3 innovation)

            Dynamic compute allocation per query complexity

```

## License

Apache 2.0.

## Author

Developed by [DJLougen](https://huggingface.co/DJLougen).

The BOREAL family is pretrained from scratch β€” no fine-tuning, no distillation,
no inherited weights. Born in Toronto. Trained with Canadian stubbornness.

[β˜• Support on Ko-fi](https://ko-fi.com/djlougen)