BOREAL-250M / README.md
DJLougen's picture
Upload README.md with huggingface_hub
f6fe4f8 verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- pretraining
- tst
- crucible
- ddm
- submodular
- data-curation
- sovereign-ai
- canadian-ai
- edge
- efficient
- canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-250M
---
![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-250M/resolve/main/Boreal.png)
# BOREAL-250M — Sovereign Canadian AI
**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute,
anyone's models, or anyone's permission.
A 250M-parameter dense hybrid language model pretrained from scratch. Built on the
Gated DeltaNet architecture — the same hybrid linear-attention design that powers
Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum
data efficiency per GPU-hour.
BOREAL-250M is the smallest member of the BOREAL family and the first proof point
in a sovereign Canadian AI pipeline. It validates that a single researcher on a
single GPU in Toronto can build competitive model architectures without relying
on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest
covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition
and imported infrastructure.
## Why Canadian AI Sovereignty Matters
Every major Canadian AI deployment today runs on someone else's model. Qwen
(Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class
AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco
and Beijing. The models, the compute, and the decisions about what gets built
stay elsewhere.
BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a
Toronto apartment, an Apache 2.0 license, and an architecture that combines
proven innovations from open research. No distillation from proprietary models.
No dependency on anyone's API. Built here. Owned here.
## Architecture
| Component | Detail |
|-----------|--------|
| **Type** | Dense hybrid — Gated DeltaNet + GQA |
| **Parameters** | 250M |
| **Hidden size** | 1,024 |
| **Layers** | 12 (9 DeltaNet + 3 full attention) |
| **Ratio** | 3:1 linear-to-full attention |
| **Full attention** | GQA: 8 query heads, 2 KV heads, head_dim=256 |
| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| **Conv kernel** | 4 (local context mixing) |
| **FFN** | SwiGLU, intermediate=3,072 |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated attention outputs |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 32,768 tokens native |
| **MTP** | 1 multi-token prediction head |
### Architecture Rationale
**Gated DeltaNet over pure attention.** 75% of layers use linear attention
with data-dependent forgetting gates. Each DeltaNet layer maintains a
fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t
is a learned sigmoid gate controlling information retention. The result: O(n)
on 75% of layers, enabling native long-context processing without quadratic
memory blowup.
**Larger head dims (256).** Following Qwen3.5 and DeepSeek-V4, head_dim jumps
from the traditional 128 to 256. Fewer heads with more per-head capacity,
paired with aggressive GQA (4:1 query-to-KV ratio).
**Partial RoPE (0.25).** Only 25% of each head's dimensions receive rotary
positional encoding. The remaining 75% pass through position-agnostically,
creating a natural pathway for the DeltaNet's recurrent state.
**Output gating.** Every attention and DeltaNet output passes through a
learned Swish gate: `output = attention(x) * silu(W_gate * x)`.
## Training
| Parameter | Value |
|-----------|-------|
| **Data tokens** | 10B–200B (overtrained regime) |
| **Corpus** | FineWeb-Edu + StarCoder2 code |
| **Training method** | Token Superposition Training (Nous Research) |
| **TST config** | s=4 bags, r=0.5 fraction |
| **Objective** | Phase 1: multi-hot cross-entropy → Phase 2: standard NTP |
| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
| **Peak LR** | 3e-4 (from MuP sweep) |
| **Schedule** | Cosine decay to 10% peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states |
| **Location** | Toronto, Ontario, Canada |
### Data Pipeline
BOREAL is built on **Crucible** — a submodular data selection framework:
- **RUPS** (Reward-Utility Pareto Skyline): multi-axis scoring fusion
computing per-sample weights from quality and complexity axes.
- **EESD** (Embedding-Ensemble Submodular Diversity): lazy-greedy
maximization of weighted facility-location objective, formal
(1 - 1/e) approximation guarantee. Runs on 50K+ items without
materializing the full pairwise similarity matrix.
- **Length-debiasing**: stratified selection across response-length
quantiles.
Samples are scored by the **DDM analyzer** (Drift Diffusion Model):
1. Reasoning traces segmented at cognitive boundaries
2. Per-segment quality extracted (self-correction, verification,
exploration density)
3. Ornstein-Uhlenbeck evidence accumulation: `dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt`
4. Sustained boundary crossings flag degenerate reasoning
5. Samples classified into curriculum bins with per-sample loss weights
This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
rows outperformed datasets 20–100x larger. The 8-factor quality model backs
it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness
r=+0.852.
### TST (Token Superposition Training)
Drop-in training acceleration from Nous Research (arXiv:2605.06546). First
50% of training uses superposed embeddings — 4 tokens averaged into one,
predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput.
Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero
architecture changes.
## Sovereign AI Scaling Ladder
Every model trained in Canada. Every weight learned from random init.
| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **BOREAL-250M** | 250M | Dense | 32K | Planned — seeking compute |
| **[BOREAL-2B](https://huggingface.co/GestaltLabs/BOREAL-2B)** | 2B | Dense DeltaNet | 64K | Planned |
| **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Planned |
## Expectations
BOREAL-250M is an architecture validation tool. Expect:
- Coherent text generation
- Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
- Clean scaling curves through 200B+ tokens
- Long-context advantage over pure Transformer baselines at 8K+
For a model you'd actually use, see BOREAL-2B.
## License
Apache 2.0. Built for Canadian researchers, startups, and institutions.
No strings. No API keys. No foreign dependency.
## Author
Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
PhD candidate in visual neuroscience, University of Toronto.
Pretraining on a DGX Spark in a Toronto apartment.
Compute self-funded. No institutional backing. No cloud credits.
Just a thesis to finish and a conviction that Canada should
own its own models.
[☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)