# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("GestaltLabs/BOREAL-2B", dtype="auto")BOREAL-2B β Canadian Sovereign AI
Balanced Orthogonal Recurrent Expert Attention Layers
Built in Toronto. Apache 2.0.
A 2-billion-parameter dense hybrid language model pretrained from scratch on 500Bβ2T tokens. BOREAL-2B is the first model in the BOREAL family intended for actual downstream use β the one you download, fine-tune, quantize, and build on. It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and scales it to a size where benchmarks become meaningful.
Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while offering native 64K context β 4x what pure Transformers at this scale can practically support.
Architecture
| Component | Detail |
|---|---|
| Type | Dense hybrid β Gated DeltaNet + GQA |
| Parameters | ~2B |
| Hidden size | 2,048 |
| Layers | 32 (24 DeltaNet + 8 full attention) |
| Ratio | 3:1 linear-to-full attention |
| Full attention | GQA: 16 query heads, 4 KV heads, head_dim=256 |
| DeltaNet | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| Conv kernel | 4 |
| FFN | SwiGLU, intermediate=6,144 |
| Norm | RMSNorm, eps=1e-6 |
| Position | RoPE, theta=10M, partial_rotary_factor=0.25 |
| Output gate | Swish-gated |
| Vocab | 151,936 (Qwen3 tokenizer) |
| Context | 65,536 tokens native, extensible to 256K |
| MTP | 1 multi-token prediction head |
Training
| Parameter | Value |
|---|---|
| Data tokens | 500Bβ2T |
| Corpus | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
| Method | Token Superposition Training (Nous Research) |
| TST config | s=4, r=0.5 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| Peak LR | 3e-4 |
| Schedule | Cosine decay to 10% peak |
| Weight decay | 0.1 |
| Batch size | ~4M tokens/step |
| Precision | BF16 weights, FP32 DeltaNet states |
| Location | Toronto, Ontario, Canada |
Data Pipeline
Built on Crucible β RUPS skyline weighting + EESD submodular diversity selection with formal (1 - 1/e) guarantees β and the DDM analyzer that models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated rows outperformed datasets 20β100x larger.
Training Phases
Phase 1 (TST): First 50% of tokens in superposition mode
Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
Phase 4 (Anneal): Crucible-selected high-quality data, DDM loss weights
Expected Performance
| Benchmark | Target | Comparison |
|---|---|---|
| HellaSwag | 55β62% | Qwen3-1.7B: ~58% |
| ARC-Easy | 65β72% | Qwen3-1.7B: ~68% |
| PIQA | 72β78% | Qwen3-1.7B: ~75% |
| WinoGrande | 58β64% | Qwen3-1.7B: ~60% |
| MMLU (5-shot) | 28β35% | Qwen3-1.7B: ~32% |
BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context length and using roughly half the inference memory at long context.
The BOREAL Family
Every model trained in Canada. Every weight learned from random init.
| Model | Params | Type | Context | Status |
|---|---|---|---|---|
| BOREAL-250M | 250M | Dense | 32K | Seeking compute |
| BOREAL-2B | 2B | Dense | 64K | Seeking compute |
| BOREAL-10B-MoE | ~10B / ~2B active | DeltaNet + MoE | 256K | Cluster required |
License
Apache 2.0. Built for Canadian researchers, startups, and institutions. No strings. No API keys. No foreign dependency.
Author
Built by DJLougen / GestaltLabs. University of Toronto. Toronto, Canada.
Model tree for GestaltLabs/BOREAL-2B
Unable to build the model tree, the base model loops to the model itself. Learn more.

# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="GestaltLabs/BOREAL-2B")