language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- pretraining
- tst
- crucible
- ddm
- submodular
- data-curation
- sovereign-ai
- canadian-ai
- edge
- efficient
- canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-250M
BOREAL-250M — Sovereign Canadian AI
Balanced Orthogonal Recurrent Expert Attention Layers
Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute, anyone's models, or anyone's permission.
A 250M-parameter dense hybrid language model pretrained from scratch. Built on the Gated DeltaNet architecture — the same hybrid linear-attention design that powers Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum data efficiency per GPU-hour.
BOREAL-250M is the smallest member of the BOREAL family and the first proof point in a sovereign Canadian AI pipeline. It validates that a single researcher on a single GPU in Toronto can build competitive model architectures without relying on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition and imported infrastructure.
Why Canadian AI Sovereignty Matters
Every major Canadian AI deployment today runs on someone else's model. Qwen (Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco and Beijing. The models, the compute, and the decisions about what gets built stay elsewhere.
BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a Toronto apartment, an Apache 2.0 license, and an architecture that combines proven innovations from open research. No distillation from proprietary models. No dependency on anyone's API. Built here. Owned here.
Architecture
| Component | Detail |
|---|---|
| Type | Dense hybrid — Gated DeltaNet + GQA |
| Parameters | 250M |
| Hidden size | 1,024 |
| Layers | 12 (9 DeltaNet + 3 full attention) |
| Ratio | 3:1 linear-to-full attention |
| Full attention | GQA: 8 query heads, 2 KV heads, head_dim=256 |
| DeltaNet | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| Conv kernel | 4 (local context mixing) |
| FFN | SwiGLU, intermediate=3,072 |
| Norm | RMSNorm, eps=1e-6 |
| Position | RoPE, theta=10M, partial_rotary_factor=0.25 |
| Output gate | Swish-gated attention outputs |
| Vocab | 151,936 (Qwen3 tokenizer) |
| Context | 32,768 tokens native |
| MTP | 1 multi-token prediction head |
Architecture Rationale
Gated DeltaNet over pure attention. 75% of layers use linear attention with data-dependent forgetting gates. Each DeltaNet layer maintains a fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t is a learned sigmoid gate controlling information retention. The result: O(n) on 75% of layers, enabling native long-context processing without quadratic memory blowup.
Larger head dims (256). Following Qwen3.5 and DeepSeek-V4, head_dim jumps from the traditional 128 to 256. Fewer heads with more per-head capacity, paired with aggressive GQA (4:1 query-to-KV ratio).
Partial RoPE (0.25). Only 25% of each head's dimensions receive rotary positional encoding. The remaining 75% pass through position-agnostically, creating a natural pathway for the DeltaNet's recurrent state.
Output gating. Every attention and DeltaNet output passes through a
learned Swish gate: output = attention(x) * silu(W_gate * x).
Training
| Parameter | Value |
|---|---|
| Data tokens | 10B–200B (overtrained regime) |
| Corpus | FineWeb-Edu + StarCoder2 code |
| Training method | Token Superposition Training (Nous Research) |
| TST config | s=4 bags, r=0.5 fraction |
| Objective | Phase 1: multi-hot cross-entropy → Phase 2: standard NTP |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Peak LR | 3e-4 (from MuP sweep) |
| Schedule | Cosine decay to 10% peak |
| Weight decay | 0.1 |
| Batch size | ~4M tokens/step |
| Precision | BF16 weights, FP32 DeltaNet states |
| Location | Toronto, Ontario, Canada |
Data Pipeline
BOREAL is built on Crucible — a submodular data selection framework:
RUPS (Reward-Utility Pareto Skyline): multi-axis scoring fusion computing per-sample weights from quality and complexity axes.
EESD (Embedding-Ensemble Submodular Diversity): lazy-greedy maximization of weighted facility-location objective, formal (1 - 1/e) approximation guarantee. Runs on 50K+ items without materializing the full pairwise similarity matrix.
Length-debiasing: stratified selection across response-length quantiles.
Samples are scored by the DDM analyzer (Drift Diffusion Model):
- Reasoning traces segmented at cognitive boundaries
- Per-segment quality extracted (self-correction, verification, exploration density)
- Ornstein-Uhlenbeck evidence accumulation:
dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt - Sustained boundary crossings flag degenerate reasoning
- Samples classified into curriculum bins with per-sample loss weights
This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated rows outperformed datasets 20–100x larger. The 8-factor quality model backs it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness r=+0.852.
TST (Token Superposition Training)
Drop-in training acceleration from Nous Research (arXiv:2605.06546). First 50% of training uses superposed embeddings — 4 tokens averaged into one, predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput. Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero architecture changes.
Sovereign AI Scaling Ladder
Every model trained in Canada. Every weight learned from random init.
| Model | Params | Type | Context | Status |
|---|---|---|---|---|
| BOREAL-250M | 250M | Dense | 32K | Planned — seeking compute |
| BOREAL-2B | 2B | Dense DeltaNet | 64K | Planned |
| BOREAL-10B-MoE | ~10B / ~2B active | DeltaNet + MoE | 256K | Planned |
Expectations
BOREAL-250M is an architecture validation tool. Expect:
- Coherent text generation
- Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
- Clean scaling curves through 200B+ tokens
- Long-context advantage over pure Transformer baselines at 8K+
For a model you'd actually use, see BOREAL-2B.
License
Apache 2.0. Built for Canadian researchers, startups, and institutions. No strings. No API keys. No foreign dependency.
Author
Built by DJLougen / GestaltLabs.
PhD candidate in visual neuroscience, University of Toronto. Pretraining on a DGX Spark in a Toronto apartment. Compute self-funded. No institutional backing. No cloud credits. Just a thesis to finish and a conviction that Canada should own its own models.
