File size: 7,644 Bytes
52f977d 951c606 52f977d 951c606 52f977d f6fe4f8 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 52f977d 951c606 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | ---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- pretraining
- tst
- crucible
- ddm
- submodular
- data-curation
- sovereign-ai
- canadian-ai
- edge
- efficient
- canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-250M
---

# BOREAL-250M — Sovereign Canadian AI
**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute,
anyone's models, or anyone's permission.
A 250M-parameter dense hybrid language model pretrained from scratch. Built on the
Gated DeltaNet architecture — the same hybrid linear-attention design that powers
Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum
data efficiency per GPU-hour.
BOREAL-250M is the smallest member of the BOREAL family and the first proof point
in a sovereign Canadian AI pipeline. It validates that a single researcher on a
single GPU in Toronto can build competitive model architectures without relying
on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest
covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition
and imported infrastructure.
## Why Canadian AI Sovereignty Matters
Every major Canadian AI deployment today runs on someone else's model. Qwen
(Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class
AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco
and Beijing. The models, the compute, and the decisions about what gets built
stay elsewhere.
BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a
Toronto apartment, an Apache 2.0 license, and an architecture that combines
proven innovations from open research. No distillation from proprietary models.
No dependency on anyone's API. Built here. Owned here.
## Architecture
| Component | Detail |
|-----------|--------|
| **Type** | Dense hybrid — Gated DeltaNet + GQA |
| **Parameters** | 250M |
| **Hidden size** | 1,024 |
| **Layers** | 12 (9 DeltaNet + 3 full attention) |
| **Ratio** | 3:1 linear-to-full attention |
| **Full attention** | GQA: 8 query heads, 2 KV heads, head_dim=256 |
| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| **Conv kernel** | 4 (local context mixing) |
| **FFN** | SwiGLU, intermediate=3,072 |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated attention outputs |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 32,768 tokens native |
| **MTP** | 1 multi-token prediction head |
### Architecture Rationale
**Gated DeltaNet over pure attention.** 75% of layers use linear attention
with data-dependent forgetting gates. Each DeltaNet layer maintains a
fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t
is a learned sigmoid gate controlling information retention. The result: O(n)
on 75% of layers, enabling native long-context processing without quadratic
memory blowup.
**Larger head dims (256).** Following Qwen3.5 and DeepSeek-V4, head_dim jumps
from the traditional 128 to 256. Fewer heads with more per-head capacity,
paired with aggressive GQA (4:1 query-to-KV ratio).
**Partial RoPE (0.25).** Only 25% of each head's dimensions receive rotary
positional encoding. The remaining 75% pass through position-agnostically,
creating a natural pathway for the DeltaNet's recurrent state.
**Output gating.** Every attention and DeltaNet output passes through a
learned Swish gate: `output = attention(x) * silu(W_gate * x)`.
## Training
| Parameter | Value |
|-----------|-------|
| **Data tokens** | 10B–200B (overtrained regime) |
| **Corpus** | FineWeb-Edu + StarCoder2 code |
| **Training method** | Token Superposition Training (Nous Research) |
| **TST config** | s=4 bags, r=0.5 fraction |
| **Objective** | Phase 1: multi-hot cross-entropy → Phase 2: standard NTP |
| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
| **Peak LR** | 3e-4 (from MuP sweep) |
| **Schedule** | Cosine decay to 10% peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states |
| **Location** | Toronto, Ontario, Canada |
### Data Pipeline
BOREAL is built on **Crucible** — a submodular data selection framework:
- **RUPS** (Reward-Utility Pareto Skyline): multi-axis scoring fusion
computing per-sample weights from quality and complexity axes.
- **EESD** (Embedding-Ensemble Submodular Diversity): lazy-greedy
maximization of weighted facility-location objective, formal
(1 - 1/e) approximation guarantee. Runs on 50K+ items without
materializing the full pairwise similarity matrix.
- **Length-debiasing**: stratified selection across response-length
quantiles.
Samples are scored by the **DDM analyzer** (Drift Diffusion Model):
1. Reasoning traces segmented at cognitive boundaries
2. Per-segment quality extracted (self-correction, verification,
exploration density)
3. Ornstein-Uhlenbeck evidence accumulation: `dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt`
4. Sustained boundary crossings flag degenerate reasoning
5. Samples classified into curriculum bins with per-sample loss weights
This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
rows outperformed datasets 20–100x larger. The 8-factor quality model backs
it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness
r=+0.852.
### TST (Token Superposition Training)
Drop-in training acceleration from Nous Research (arXiv:2605.06546). First
50% of training uses superposed embeddings — 4 tokens averaged into one,
predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput.
Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero
architecture changes.
## Sovereign AI Scaling Ladder
Every model trained in Canada. Every weight learned from random init.
| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **BOREAL-250M** | 250M | Dense | 32K | Planned — seeking compute |
| **[BOREAL-2B](https://huggingface.co/GestaltLabs/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Planned |
| **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Planned |
## Expectations
BOREAL-250M is an architecture validation tool. Expect:
- Coherent text generation
- Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
- Clean scaling curves through 200B+ tokens
- Long-context advantage over pure Transformer baselines at 8K+
For a model you'd actually use, see BOREAL-2B.
## License
Apache 2.0. Built for Canadian researchers, startups, and institutions.
No strings. No API keys. No foreign dependency.
## Author
Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
PhD candidate in visual neuroscience, University of Toronto.
Pretraining on a DGX Spark in a Toronto apartment.
Compute self-funded. No institutional backing. No cloud credits.
Just a thesis to finish and a conviction that Canada should
own its own models.
[☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen) |