BOREAL-250M / README.md
DJLougen's picture
Upload README.md with huggingface_hub
f6fe4f8 verified
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - boreal
  - deltanet
  - hybrid
  - linear-attention
  - swiglu
  - rmsnorm
  - rope
  - gqa
  - pretraining
  - tst
  - crucible
  - ddm
  - submodular
  - data-curation
  - sovereign-ai
  - canadian-ai
  - edge
  - efficient
  - canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-250M

BOREAL

BOREAL-250M — Sovereign Canadian AI

Balanced Orthogonal Recurrent Expert Attention Layers

Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute, anyone's models, or anyone's permission.

A 250M-parameter dense hybrid language model pretrained from scratch. Built on the Gated DeltaNet architecture — the same hybrid linear-attention design that powers Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum data efficiency per GPU-hour.

BOREAL-250M is the smallest member of the BOREAL family and the first proof point in a sovereign Canadian AI pipeline. It validates that a single researcher on a single GPU in Toronto can build competitive model architectures without relying on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition and imported infrastructure.

Why Canadian AI Sovereignty Matters

Every major Canadian AI deployment today runs on someone else's model. Qwen (Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco and Beijing. The models, the compute, and the decisions about what gets built stay elsewhere.

BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a Toronto apartment, an Apache 2.0 license, and an architecture that combines proven innovations from open research. No distillation from proprietary models. No dependency on anyone's API. Built here. Owned here.

Architecture

Component Detail
Type Dense hybrid — Gated DeltaNet + GQA
Parameters 250M
Hidden size 1,024
Layers 12 (9 DeltaNet + 3 full attention)
Ratio 3:1 linear-to-full attention
Full attention GQA: 8 query heads, 2 KV heads, head_dim=256
DeltaNet Gated linear attention: 8 QK heads, 16 V heads, head_dim=128
Conv kernel 4 (local context mixing)
FFN SwiGLU, intermediate=3,072
Norm RMSNorm, eps=1e-6
Position RoPE, theta=10M, partial_rotary_factor=0.25
Output gate Swish-gated attention outputs
Vocab 151,936 (Qwen3 tokenizer)
Context 32,768 tokens native
MTP 1 multi-token prediction head

Architecture Rationale

Gated DeltaNet over pure attention. 75% of layers use linear attention with data-dependent forgetting gates. Each DeltaNet layer maintains a fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t is a learned sigmoid gate controlling information retention. The result: O(n) on 75% of layers, enabling native long-context processing without quadratic memory blowup.

Larger head dims (256). Following Qwen3.5 and DeepSeek-V4, head_dim jumps from the traditional 128 to 256. Fewer heads with more per-head capacity, paired with aggressive GQA (4:1 query-to-KV ratio).

Partial RoPE (0.25). Only 25% of each head's dimensions receive rotary positional encoding. The remaining 75% pass through position-agnostically, creating a natural pathway for the DeltaNet's recurrent state.

Output gating. Every attention and DeltaNet output passes through a learned Swish gate: output = attention(x) * silu(W_gate * x).

Training

Parameter Value
Data tokens 10B–200B (overtrained regime)
Corpus FineWeb-Edu + StarCoder2 code
Training method Token Superposition Training (Nous Research)
TST config s=4 bags, r=0.5 fraction
Objective Phase 1: multi-hot cross-entropy → Phase 2: standard NTP
Optimizer AdamW (β₁=0.9, β₂=0.95)
Peak LR 3e-4 (from MuP sweep)
Schedule Cosine decay to 10% peak
Weight decay 0.1
Batch size ~4M tokens/step
Precision BF16 weights, FP32 DeltaNet states
Location Toronto, Ontario, Canada

Data Pipeline

BOREAL is built on Crucible — a submodular data selection framework:

  • RUPS (Reward-Utility Pareto Skyline): multi-axis scoring fusion computing per-sample weights from quality and complexity axes.

  • EESD (Embedding-Ensemble Submodular Diversity): lazy-greedy maximization of weighted facility-location objective, formal (1 - 1/e) approximation guarantee. Runs on 50K+ items without materializing the full pairwise similarity matrix.

  • Length-debiasing: stratified selection across response-length quantiles.

Samples are scored by the DDM analyzer (Drift Diffusion Model):

  1. Reasoning traces segmented at cognitive boundaries
  2. Per-segment quality extracted (self-correction, verification, exploration density)
  3. Ornstein-Uhlenbeck evidence accumulation: dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt
  4. Sustained boundary crossings flag degenerate reasoning
  5. Samples classified into curriculum bins with per-sample loss weights

This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated rows outperformed datasets 20–100x larger. The 8-factor quality model backs it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness r=+0.852.

TST (Token Superposition Training)

Drop-in training acceleration from Nous Research (arXiv:2605.06546). First 50% of training uses superposed embeddings — 4 tokens averaged into one, predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput. Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero architecture changes.

Sovereign AI Scaling Ladder

Every model trained in Canada. Every weight learned from random init.

Model Params Type Context Status
BOREAL-250M 250M Dense 32K Planned — seeking compute
BOREAL-2B 2B Dense DeltaNet 64K Planned
BOREAL-10B-MoE ~10B / ~2B active DeltaNet + MoE 256K Planned

Expectations

BOREAL-250M is an architecture validation tool. Expect:

  • Coherent text generation
  • Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
  • Clean scaling curves through 200B+ tokens
  • Long-context advantage over pure Transformer baselines at 8K+

For a model you'd actually use, see BOREAL-2B.

License

Apache 2.0. Built for Canadian researchers, startups, and institutions. No strings. No API keys. No foreign dependency.

Author

Built by DJLougen / GestaltLabs.

PhD candidate in visual neuroscience, University of Toronto. Pretraining on a DGX Spark in a Toronto apartment. Compute self-funded. No institutional backing. No cloud credits. Just a thesis to finish and a conviction that Canada should own its own models.

☕ Support sovereign AI on Ko-fi