--- language: - en license: apache-2.0 library_name: transformers tags: - boreal - deltanet - hybrid - linear-attention - swiglu - rmsnorm - rope - gqa - pretraining - tst - crucible - ddm - submodular - data-curation - sovereign-ai - canadian-ai - edge - efficient - canada pipeline_tag: text-generation base_model: GestaltLabs/BOREAL-250M --- ![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-250M/resolve/main/Boreal.png) # BOREAL-250M — Sovereign Canadian AI **B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute, anyone's models, or anyone's permission. A 250M-parameter dense hybrid language model pretrained from scratch. Built on the Gated DeltaNet architecture — the same hybrid linear-attention design that powers Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum data efficiency per GPU-hour. BOREAL-250M is the smallest member of the BOREAL family and the first proof point in a sovereign Canadian AI pipeline. It validates that a single researcher on a single GPU in Toronto can build competitive model architectures without relying on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition and imported infrastructure. ## Why Canadian AI Sovereignty Matters Every major Canadian AI deployment today runs on someone else's model. Qwen (Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco and Beijing. The models, the compute, and the decisions about what gets built stay elsewhere. BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a Toronto apartment, an Apache 2.0 license, and an architecture that combines proven innovations from open research. No distillation from proprietary models. No dependency on anyone's API. Built here. Owned here. ## Architecture | Component | Detail | |-----------|--------| | **Type** | Dense hybrid — Gated DeltaNet + GQA | | **Parameters** | 250M | | **Hidden size** | 1,024 | | **Layers** | 12 (9 DeltaNet + 3 full attention) | | **Ratio** | 3:1 linear-to-full attention | | **Full attention** | GQA: 8 query heads, 2 KV heads, head_dim=256 | | **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 | | **Conv kernel** | 4 (local context mixing) | | **FFN** | SwiGLU, intermediate=3,072 | | **Norm** | RMSNorm, eps=1e-6 | | **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 | | **Output gate** | Swish-gated attention outputs | | **Vocab** | 151,936 (Qwen3 tokenizer) | | **Context** | 32,768 tokens native | | **MTP** | 1 multi-token prediction head | ### Architecture Rationale **Gated DeltaNet over pure attention.** 75% of layers use linear attention with data-dependent forgetting gates. Each DeltaNet layer maintains a fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t is a learned sigmoid gate controlling information retention. The result: O(n) on 75% of layers, enabling native long-context processing without quadratic memory blowup. **Larger head dims (256).** Following Qwen3.5 and DeepSeek-V4, head_dim jumps from the traditional 128 to 256. Fewer heads with more per-head capacity, paired with aggressive GQA (4:1 query-to-KV ratio). **Partial RoPE (0.25).** Only 25% of each head's dimensions receive rotary positional encoding. The remaining 75% pass through position-agnostically, creating a natural pathway for the DeltaNet's recurrent state. **Output gating.** Every attention and DeltaNet output passes through a learned Swish gate: `output = attention(x) * silu(W_gate * x)`. ## Training | Parameter | Value | |-----------|-------| | **Data tokens** | 10B–200B (overtrained regime) | | **Corpus** | FineWeb-Edu + StarCoder2 code | | **Training method** | Token Superposition Training (Nous Research) | | **TST config** | s=4 bags, r=0.5 fraction | | **Objective** | Phase 1: multi-hot cross-entropy → Phase 2: standard NTP | | **Optimizer** | AdamW (β₁=0.9, β₂=0.95) | | **Peak LR** | 3e-4 (from MuP sweep) | | **Schedule** | Cosine decay to 10% peak | | **Weight decay** | 0.1 | | **Batch size** | ~4M tokens/step | | **Precision** | BF16 weights, FP32 DeltaNet states | | **Location** | Toronto, Ontario, Canada | ### Data Pipeline BOREAL is built on **Crucible** — a submodular data selection framework: - **RUPS** (Reward-Utility Pareto Skyline): multi-axis scoring fusion computing per-sample weights from quality and complexity axes. - **EESD** (Embedding-Ensemble Submodular Diversity): lazy-greedy maximization of weighted facility-location objective, formal (1 - 1/e) approximation guarantee. Runs on 50K+ items without materializing the full pairwise similarity matrix. - **Length-debiasing**: stratified selection across response-length quantiles. Samples are scored by the **DDM analyzer** (Drift Diffusion Model): 1. Reasoning traces segmented at cognitive boundaries 2. Per-segment quality extracted (self-correction, verification, exploration density) 3. Ornstein-Uhlenbeck evidence accumulation: `dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt` 4. Sustained boundary crossings flag degenerate reasoning 5. Samples classified into curriculum bins with per-sample loss weights This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated rows outperformed datasets 20–100x larger. The 8-factor quality model backs it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness r=+0.852. ### TST (Token Superposition Training) Drop-in training acceleration from Nous Research (arXiv:2605.06546). First 50% of training uses superposed embeddings — 4 tokens averaged into one, predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput. Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero architecture changes. ## Sovereign AI Scaling Ladder Every model trained in Canada. Every weight learned from random init. | Model | Params | Type | Context | Status | |-------|--------|------|---------|--------| | **BOREAL-250M** | 250M | Dense | 32K | Planned — seeking compute | | **[BOREAL-2B](https://huggingface.co/GestaltLabs/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Planned | | **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Planned | ## Expectations BOREAL-250M is an architecture validation tool. Expect: - Coherent text generation - Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy - Clean scaling curves through 200B+ tokens - Long-context advantage over pure Transformer baselines at 8K+ For a model you'd actually use, see BOREAL-2B. ## License Apache 2.0. Built for Canadian researchers, startups, and institutions. No strings. No API keys. No foreign dependency. ## Author Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs). PhD candidate in visual neuroscience, University of Toronto. Pretraining on a DGX Spark in a Toronto apartment. Compute self-funded. No institutional backing. No cloud credits. Just a thesis to finish and a conviction that Canada should own its own models. [☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)