File size: 4,517 Bytes
ca37c15 e047be9 1880268 ca37c15 1880268 ca37c15 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- pretraining
- tst
- crucible
- ddm
- submodular
- data-curation
- sovereign-ai
- canadian-ai
- community
- canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-2B
---

# BOREAL-2B β Canadian Sovereign AI
**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
Built in Toronto. Apache 2.0.
A 2-billion-parameter dense hybrid language model pretrained from scratch on
500Bβ2T tokens. BOREAL-2B is the first model in the BOREAL family intended for
actual downstream use β the one you download, fine-tune, quantize, and build on.
It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and
scales it to a size where benchmarks become meaningful.
Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
offering native 64K context β 4x what pure Transformers at this scale can
practically support.
## Architecture
| Component | Detail |
|-----------|--------|
| **Type** | Dense hybrid β Gated DeltaNet + GQA |
| **Parameters** | ~2B |
| **Hidden size** | 2,048 |
| **Layers** | 32 (24 DeltaNet + 8 full attention) |
| **Ratio** | 3:1 linear-to-full attention |
| **Full attention** | GQA: 16 query heads, 4 KV heads, head_dim=256 |
| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| **Conv kernel** | 4 |
| **FFN** | SwiGLU, intermediate=6,144 |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 65,536 tokens native, extensible to 256K |
| **MTP** | 1 multi-token prediction head |
## Training
| Parameter | Value |
|-----------|-------|
| **Data tokens** | 500Bβ2T |
| **Corpus** | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
| **Method** | Token Superposition Training (Nous Research) |
| **TST config** | s=4, r=0.5 |
| **Optimizer** | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| **Peak LR** | 3e-4 |
| **Schedule** | Cosine decay to 10% peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states |
| **Location** | Toronto, Ontario, Canada |
### Data Pipeline
Built on **Crucible** β RUPS skyline weighting + EESD submodular diversity
selection with formal (1 - 1/e) guarantees β and the **DDM analyzer** that
models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same
pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
rows outperformed datasets 20β100x larger.
### Training Phases
```
Phase 1 (TST): First 50% of tokens in superposition mode
Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
Phase 4 (Anneal): Crucible-selected high-quality data, DDM loss weights
```
## Expected Performance
| Benchmark | Target | Comparison |
|-----------|--------|-------------|
| HellaSwag | 55β62% | Qwen3-1.7B: ~58% |
| ARC-Easy | 65β72% | Qwen3-1.7B: ~68% |
| PIQA | 72β78% | Qwen3-1.7B: ~75% |
| WinoGrande | 58β64% | Qwen3-1.7B: ~60% |
| MMLU (5-shot) | 28β35% | Qwen3-1.7B: ~32% |
BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context
length and using roughly half the inference memory at long context.
## The BOREAL Family
Every model trained in Canada. Every weight learned from random init.
| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **[BOREAL-250M](https://huggingface.co/GestaltLabs/BOREAL-250M)** | 250M | Dense | 32K | Seeking compute |
| **BOREAL-2B** | 2B | Dense | 64K | Seeking compute |
| **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Cluster required |
## License
Apache 2.0. Built for Canadian researchers, startups, and institutions.
No strings. No API keys. No foreign dependency.
## Author
Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
University of Toronto. Toronto, Canada.
[β Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen) |