| ---
|
| language:
|
| - en
|
| license: apache-2.0
|
| library_name: transformers
|
| tags:
|
| - boreal
|
| - deltanet
|
| - hybrid
|
| - linear-attention
|
| - swiglu
|
| - rmsnorm
|
| - rope
|
| - gqa
|
| - pretraining
|
| - tst
|
| - crucible
|
| - ddm
|
| - submodular
|
| - data-curation
|
| - sovereign-ai
|
| - canadian-ai
|
| - edge
|
| - efficient
|
| - canada
|
| pipeline_tag: text-generation
|
| base_model: GestaltLabs/BOREAL-250M
|
| ---
|
|
|
| 
|
|
|
| # BOREAL-250M — Sovereign Canadian AI
|
|
|
| **B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
|
|
|
| Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute,
|
| anyone's models, or anyone's permission.
|
|
|
| A 250M-parameter dense hybrid language model pretrained from scratch. Built on the
|
| Gated DeltaNet architecture — the same hybrid linear-attention design that powers
|
| Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum
|
| data efficiency per GPU-hour.
|
|
|
| BOREAL-250M is the smallest member of the BOREAL family and the first proof point
|
| in a sovereign Canadian AI pipeline. It validates that a single researcher on a
|
| single GPU in Toronto can build competitive model architectures without relying
|
| on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest
|
| covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition
|
| and imported infrastructure.
|
|
|
| ## Why Canadian AI Sovereignty Matters
|
|
|
| Every major Canadian AI deployment today runs on someone else's model. Qwen
|
| (Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class
|
| AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco
|
| and Beijing. The models, the compute, and the decisions about what gets built
|
| stay elsewhere.
|
|
|
| BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a
|
| Toronto apartment, an Apache 2.0 license, and an architecture that combines
|
| proven innovations from open research. No distillation from proprietary models.
|
| No dependency on anyone's API. Built here. Owned here.
|
|
|
| ## Architecture
|
|
|
| | Component | Detail |
|
| |-----------|--------|
|
| | **Type** | Dense hybrid — Gated DeltaNet + GQA |
|
| | **Parameters** | 250M |
|
| | **Hidden size** | 1,024 |
|
| | **Layers** | 12 (9 DeltaNet + 3 full attention) |
|
| | **Ratio** | 3:1 linear-to-full attention |
|
| | **Full attention** | GQA: 8 query heads, 2 KV heads, head_dim=256 |
|
| | **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
|
| | **Conv kernel** | 4 (local context mixing) |
|
| | **FFN** | SwiGLU, intermediate=3,072 |
|
| | **Norm** | RMSNorm, eps=1e-6 |
|
| | **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
|
| | **Output gate** | Swish-gated attention outputs |
|
| | **Vocab** | 151,936 (Qwen3 tokenizer) |
|
| | **Context** | 32,768 tokens native |
|
| | **MTP** | 1 multi-token prediction head |
|
|
|
| ### Architecture Rationale
|
|
|
| **Gated DeltaNet over pure attention.** 75% of layers use linear attention
|
| with data-dependent forgetting gates. Each DeltaNet layer maintains a
|
| fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t
|
| is a learned sigmoid gate controlling information retention. The result: O(n)
|
| on 75% of layers, enabling native long-context processing without quadratic
|
| memory blowup.
|
|
|
| **Larger head dims (256).** Following Qwen3.5 and DeepSeek-V4, head_dim jumps
|
| from the traditional 128 to 256. Fewer heads with more per-head capacity,
|
| paired with aggressive GQA (4:1 query-to-KV ratio).
|
|
|
| **Partial RoPE (0.25).** Only 25% of each head's dimensions receive rotary
|
| positional encoding. The remaining 75% pass through position-agnostically,
|
| creating a natural pathway for the DeltaNet's recurrent state.
|
|
|
| **Output gating.** Every attention and DeltaNet output passes through a
|
| learned Swish gate: `output = attention(x) * silu(W_gate * x)`.
|
|
|
| ## Training
|
|
|
| | Parameter | Value |
|
| |-----------|-------|
|
| | **Data tokens** | 10B–200B (overtrained regime) |
|
| | **Corpus** | FineWeb-Edu + StarCoder2 code |
|
| | **Training method** | Token Superposition Training (Nous Research) |
|
| | **TST config** | s=4 bags, r=0.5 fraction |
|
| | **Objective** | Phase 1: multi-hot cross-entropy → Phase 2: standard NTP |
|
| | **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
|
| | **Peak LR** | 3e-4 (from MuP sweep) |
|
| | **Schedule** | Cosine decay to 10% peak |
|
| | **Weight decay** | 0.1 |
|
| | **Batch size** | ~4M tokens/step |
|
| | **Precision** | BF16 weights, FP32 DeltaNet states |
|
| | **Location** | Toronto, Ontario, Canada |
|
|
|
| ### Data Pipeline
|
|
|
| BOREAL is built on **Crucible** — a submodular data selection framework:
|
|
|
| - **RUPS** (Reward-Utility Pareto Skyline): multi-axis scoring fusion
|
| computing per-sample weights from quality and complexity axes.
|
|
|
| - **EESD** (Embedding-Ensemble Submodular Diversity): lazy-greedy
|
| maximization of weighted facility-location objective, formal
|
| (1 - 1/e) approximation guarantee. Runs on 50K+ items without
|
| materializing the full pairwise similarity matrix.
|
|
|
| - **Length-debiasing**: stratified selection across response-length
|
| quantiles.
|
|
|
| Samples are scored by the **DDM analyzer** (Drift Diffusion Model):
|
|
|
| 1. Reasoning traces segmented at cognitive boundaries
|
| 2. Per-segment quality extracted (self-correction, verification,
|
| exploration density)
|
| 3. Ornstein-Uhlenbeck evidence accumulation: `dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt`
|
| 4. Sustained boundary crossings flag degenerate reasoning
|
| 5. Samples classified into curriculum bins with per-sample loss weights
|
|
|
| This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
|
| rows outperformed datasets 20–100x larger. The 8-factor quality model backs
|
| it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness
|
| r=+0.852.
|
|
|
| ### TST (Token Superposition Training)
|
|
|
| Drop-in training acceleration from Nous Research (arXiv:2605.06546). First
|
| 50% of training uses superposed embeddings — 4 tokens averaged into one,
|
| predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput.
|
| Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero
|
| architecture changes.
|
|
|
| ## Sovereign AI Scaling Ladder
|
|
|
| Every model trained in Canada. Every weight learned from random init.
|
|
|
| | Model | Params | Type | Context | Status |
|
| |-------|--------|------|---------|--------|
|
| | **BOREAL-250M** | 250M | Dense | 32K | Planned — seeking compute |
|
| | **[BOREAL-2B](https://huggingface.co/GestaltLabs/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Planned |
|
| | **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Planned |
|
|
|
| ## Expectations
|
|
|
| BOREAL-250M is an architecture validation tool. Expect:
|
| - Coherent text generation
|
| - Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
|
| - Clean scaling curves through 200B+ tokens
|
| - Long-context advantage over pure Transformer baselines at 8K+
|
|
|
| For a model you'd actually use, see BOREAL-2B.
|
|
|
| ## License
|
|
|
| Apache 2.0. Built for Canadian researchers, startups, and institutions.
|
| No strings. No API keys. No foreign dependency.
|
|
|
| ## Author
|
|
|
| Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
|
|
|
| PhD candidate in visual neuroscience, University of Toronto.
|
| Pretraining on a DGX Spark in a Toronto apartment.
|
| Compute self-funded. No institutional backing. No cloud credits.
|
| Just a thesis to finish and a conviction that Canada should
|
| own its own models.
|
|
|
| [☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen) |