File size: 4,517 Bytes
ca37c15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e047be9
 
1880268
ca37c15
 
 
1880268
ca37c15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---

language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- pretraining
- tst
- crucible
- ddm
- submodular
- data-curation
- sovereign-ai
- canadian-ai
- community
- canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-2B
---


![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-2B/resolve/main/Boreal.png)

# BOREAL-2B β€” Canadian Sovereign AI

**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers

Built in Toronto. Apache 2.0.

A 2-billion-parameter dense hybrid language model pretrained from scratch on
500B–2T tokens. BOREAL-2B is the first model in the BOREAL family intended for
actual downstream use β€” the one you download, fine-tune, quantize, and build on.
It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and
scales it to a size where benchmarks become meaningful.

Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
offering native 64K context β€” 4x what pure Transformers at this scale can
practically support.

## Architecture

| Component | Detail |
|-----------|--------|
| **Type** | Dense hybrid β€” Gated DeltaNet + GQA |
| **Parameters** | ~2B |
| **Hidden size** | 2,048 |
| **Layers** | 32 (24 DeltaNet + 8 full attention) |
| **Ratio** | 3:1 linear-to-full attention |
| **Full attention** | GQA: 16 query heads, 4 KV heads, head_dim=256 |

| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| **Conv kernel** | 4 |
| **FFN** | SwiGLU, intermediate=6,144 |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 65,536 tokens native, extensible to 256K |
| **MTP** | 1 multi-token prediction head |

## Training

| Parameter | Value |
|-----------|-------|
| **Data tokens** | 500B–2T |
| **Corpus** | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
| **Method** | Token Superposition Training (Nous Research) |
| **TST config** | s=4, r=0.5 |
| **Optimizer** | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
| **Peak LR** | 3e-4 |
| **Schedule** | Cosine decay to 10% peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states |
| **Location** | Toronto, Ontario, Canada |

### Data Pipeline

Built on **Crucible** β€” RUPS skyline weighting + EESD submodular diversity
selection with formal (1 - 1/e) guarantees β€” and the **DDM analyzer** that
models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same
pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
rows outperformed datasets 20–100x larger.

### Training Phases

```

Phase 1 (TST):     First 50% of tokens in superposition mode

Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP

Phase 3 (Extension): Mid-training at 64K context, YaRN scaling

Phase 4 (Anneal):   Crucible-selected high-quality data, DDM loss weights

```

## Expected Performance

| Benchmark | Target | Comparison |
|-----------|--------|-------------|
| HellaSwag | 55–62% | Qwen3-1.7B: ~58% |
| ARC-Easy | 65–72% | Qwen3-1.7B: ~68% |
| PIQA | 72–78% | Qwen3-1.7B: ~75% |
| WinoGrande | 58–64% | Qwen3-1.7B: ~60% |
| MMLU (5-shot) | 28–35% | Qwen3-1.7B: ~32% |

BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context
length and using roughly half the inference memory at long context.

## The BOREAL Family

Every model trained in Canada. Every weight learned from random init.

| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **[BOREAL-250M](https://huggingface.co/GestaltLabs/BOREAL-250M)** | 250M | Dense | 32K | Seeking compute |
| **BOREAL-2B** | 2B | Dense | 64K | Seeking compute |
| **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Cluster required |

## License

Apache 2.0. Built for Canadian researchers, startups, and institutions.
No strings. No API keys. No foreign dependency.

## Author

Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
University of Toronto. Toronto, Canada.

[β˜• Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)