File size: 7,644 Bytes
52f977d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
951c606
 
 
 
 
 
 
52f977d
 
 
 
951c606
52f977d
 
f6fe4f8
 
951c606
52f977d
 
 
951c606
 
 
52f977d
 
 
 
 
951c606
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52f977d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
951c606
 
52f977d
 
 
951c606
52f977d
 
 
951c606
52f977d
 
951c606
52f977d
 
 
 
 
951c606
52f977d
951c606
 
 
52f977d
 
 
 
 
 
951c606
52f977d
951c606
 
 
52f977d
951c606
 
52f977d
951c606
 
 
 
52f977d
951c606
 
52f977d
951c606
52f977d
951c606
 
 
 
 
 
52f977d
951c606
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52f977d
 
 
951c606
 
 
52f977d
 
 
951c606
 
 
 
 
52f977d
951c606
52f977d
 
 
951c606
 
52f977d
 
 
951c606
52f977d
951c606
 
 
 
 
52f977d
951c606
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---

language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- pretraining
- tst
- crucible
- ddm
- submodular
- data-curation
- sovereign-ai
- canadian-ai
- edge
- efficient
- canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-250M
---


![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-250M/resolve/main/Boreal.png)

# BOREAL-250M — Sovereign Canadian AI

**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers

Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute,
anyone's models, or anyone's permission.

A 250M-parameter dense hybrid language model pretrained from scratch. Built on the
Gated DeltaNet architecture — the same hybrid linear-attention design that powers
Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum
data efficiency per GPU-hour.

BOREAL-250M is the smallest member of the BOREAL family and the first proof point
in a sovereign Canadian AI pipeline. It validates that a single researcher on a
single GPU in Toronto can build competitive model architectures without relying
on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest
covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition
and imported infrastructure.

## Why Canadian AI Sovereignty Matters

Every major Canadian AI deployment today runs on someone else's model. Qwen
(Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class
AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco
and Beijing. The models, the compute, and the decisions about what gets built
stay elsewhere.

BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a
Toronto apartment, an Apache 2.0 license, and an architecture that combines
proven innovations from open research. No distillation from proprietary models.
No dependency on anyone's API. Built here. Owned here.

## Architecture

| Component | Detail |
|-----------|--------|
| **Type** | Dense hybrid — Gated DeltaNet + GQA |
| **Parameters** | 250M |
| **Hidden size** | 1,024 |
| **Layers** | 12 (9 DeltaNet + 3 full attention) |
| **Ratio** | 3:1 linear-to-full attention |
| **Full attention** | GQA: 8 query heads, 2 KV heads, head_dim=256 |

| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| **Conv kernel** | 4 (local context mixing) |
| **FFN** | SwiGLU, intermediate=3,072 |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated attention outputs |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 32,768 tokens native |
| **MTP** | 1 multi-token prediction head |

### Architecture Rationale

**Gated DeltaNet over pure attention.** 75% of layers use linear attention
with data-dependent forgetting gates. Each DeltaNet layer maintains a
fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t
is a learned sigmoid gate controlling information retention. The result: O(n)
on 75% of layers, enabling native long-context processing without quadratic
memory blowup.

**Larger head dims (256).** Following Qwen3.5 and DeepSeek-V4, head_dim jumps

from the traditional 128 to 256. Fewer heads with more per-head capacity,

paired with aggressive GQA (4:1 query-to-KV ratio).



**Partial RoPE (0.25).** Only 25% of each head's dimensions receive rotary

positional encoding. The remaining 75% pass through position-agnostically,

creating a natural pathway for the DeltaNet's recurrent state.



**Output gating.** Every attention and DeltaNet output passes through a

learned Swish gate: `output = attention(x) * silu(W_gate * x)`.

## Training

| Parameter | Value |
|-----------|-------|
| **Data tokens** | 10B–200B (overtrained regime) |
| **Corpus** | FineWeb-Edu + StarCoder2 code |
| **Training method** | Token Superposition Training (Nous Research) |
| **TST config** | s=4 bags, r=0.5 fraction |
| **Objective** | Phase 1: multi-hot cross-entropy → Phase 2: standard NTP |
| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
| **Peak LR** | 3e-4 (from MuP sweep) |
| **Schedule** | Cosine decay to 10% peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states |
| **Location** | Toronto, Ontario, Canada |

### Data Pipeline

BOREAL is built on **Crucible** — a submodular data selection framework:

- **RUPS** (Reward-Utility Pareto Skyline): multi-axis scoring fusion
  computing per-sample weights from quality and complexity axes.

- **EESD** (Embedding-Ensemble Submodular Diversity): lazy-greedy
  maximization of weighted facility-location objective, formal
  (1 - 1/e) approximation guarantee. Runs on 50K+ items without
  materializing the full pairwise similarity matrix.

- **Length-debiasing**: stratified selection across response-length
  quantiles.

Samples are scored by the **DDM analyzer** (Drift Diffusion Model):

1. Reasoning traces segmented at cognitive boundaries
2. Per-segment quality extracted (self-correction, verification,
   exploration density)
3. Ornstein-Uhlenbeck evidence accumulation: `dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt`
4. Sustained boundary crossings flag degenerate reasoning
5. Samples classified into curriculum bins with per-sample loss weights

This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
rows outperformed datasets 20–100x larger. The 8-factor quality model backs
it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness
r=+0.852.

### TST (Token Superposition Training)

Drop-in training acceleration from Nous Research (arXiv:2605.06546). First
50% of training uses superposed embeddings — 4 tokens averaged into one,
predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput.
Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero
architecture changes.

## Sovereign AI Scaling Ladder

Every model trained in Canada. Every weight learned from random init.

| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **BOREAL-250M** | 250M | Dense | 32K | Planned — seeking compute |
| **[BOREAL-2B](https://huggingface.co/GestaltLabs/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Planned |
| **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Planned |

## Expectations

BOREAL-250M is an architecture validation tool. Expect:
- Coherent text generation
- Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
- Clean scaling curves through 200B+ tokens
- Long-context advantage over pure Transformer baselines at 8K+

For a model you'd actually use, see BOREAL-2B.

## License

Apache 2.0. Built for Canadian researchers, startups, and institutions.
No strings. No API keys. No foreign dependency.

## Author

Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).

PhD candidate in visual neuroscience, University of Toronto.
Pretraining on a DGX Spark in a Toronto apartment.
Compute self-funded. No institutional backing. No cloud credits.
Just a thesis to finish and a conviction that Canada should
own its own models.

[☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)