BOREAL-2B / README.md
DJLougen's picture
Upload README.md with huggingface_hub
e047be9 verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- boreal
- deltanet
- hybrid
- linear-attention
- swiglu
- rmsnorm
- rope
- gqa
- pretraining
- tst
- crucible
- ddm
- submodular
- data-curation
- sovereign-ai
- canadian-ai
- community
- canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-2B
---
![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-2B/resolve/main/Boreal.png)
# BOREAL-2B β€” Canadian Sovereign AI
**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
Built in Toronto. Apache 2.0.
A 2-billion-parameter dense hybrid language model pretrained from scratch on
500B–2T tokens. BOREAL-2B is the first model in the BOREAL family intended for
actual downstream use β€” the one you download, fine-tune, quantize, and build on.
It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and
scales it to a size where benchmarks become meaningful.
Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
offering native 64K context β€” 4x what pure Transformers at this scale can
practically support.
## Architecture
| Component | Detail |
|-----------|--------|
| **Type** | Dense hybrid β€” Gated DeltaNet + GQA |
| **Parameters** | ~2B |
| **Hidden size** | 2,048 |
| **Layers** | 32 (24 DeltaNet + 8 full attention) |
| **Ratio** | 3:1 linear-to-full attention |
| **Full attention** | GQA: 16 query heads, 4 KV heads, head_dim=256 |
| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| **Conv kernel** | 4 |
| **FFN** | SwiGLU, intermediate=6,144 |
| **Norm** | RMSNorm, eps=1e-6 |
| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
| **Output gate** | Swish-gated |
| **Vocab** | 151,936 (Qwen3 tokenizer) |
| **Context** | 65,536 tokens native, extensible to 256K |
| **MTP** | 1 multi-token prediction head |
## Training
| Parameter | Value |
|-----------|-------|
| **Data tokens** | 500B–2T |
| **Corpus** | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
| **Method** | Token Superposition Training (Nous Research) |
| **TST config** | s=4, r=0.5 |
| **Optimizer** | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
| **Peak LR** | 3e-4 |
| **Schedule** | Cosine decay to 10% peak |
| **Weight decay** | 0.1 |
| **Batch size** | ~4M tokens/step |
| **Precision** | BF16 weights, FP32 DeltaNet states |
| **Location** | Toronto, Ontario, Canada |
### Data Pipeline
Built on **Crucible** β€” RUPS skyline weighting + EESD submodular diversity
selection with formal (1 - 1/e) guarantees β€” and the **DDM analyzer** that
models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same
pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
rows outperformed datasets 20–100x larger.
### Training Phases
```
Phase 1 (TST): First 50% of tokens in superposition mode
Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
Phase 4 (Anneal): Crucible-selected high-quality data, DDM loss weights
```
## Expected Performance
| Benchmark | Target | Comparison |
|-----------|--------|-------------|
| HellaSwag | 55–62% | Qwen3-1.7B: ~58% |
| ARC-Easy | 65–72% | Qwen3-1.7B: ~68% |
| PIQA | 72–78% | Qwen3-1.7B: ~75% |
| WinoGrande | 58–64% | Qwen3-1.7B: ~60% |
| MMLU (5-shot) | 28–35% | Qwen3-1.7B: ~32% |
BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context
length and using roughly half the inference memory at long context.
## The BOREAL Family
Every model trained in Canada. Every weight learned from random init.
| Model | Params | Type | Context | Status |
|-------|--------|------|---------|--------|
| **[BOREAL-250M](https://huggingface.co/GestaltLabs/BOREAL-250M)** | 250M | Dense | 32K | Seeking compute |
| **BOREAL-2B** | 2B | Dense | 64K | Seeking compute |
| **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Cluster required |
## License
Apache 2.0. Built for Canadian researchers, startups, and institutions.
No strings. No API keys. No foreign dependency.
## Author
Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
University of Toronto. Toronto, Canada.
[β˜• Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)