Upload README.md with huggingface_hub

e047be9 verified about 16 hours ago

4.52 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- boreal
	- deltanet
	- hybrid
	- linear-attention
	- swiglu
	- rmsnorm
	- rope
	- gqa
	- pretraining
	- tst
	- crucible
	- ddm
	- submodular
	- data-curation
	- sovereign-ai
	- canadian-ai
	- community
	- canada
	pipeline_tag: text-generation
	base_model: GestaltLabs/BOREAL-2B
	---

	![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-2B/resolve/main/Boreal.png)

	# BOREAL-2B — Canadian Sovereign AI

	Balanced Orthogonal Recurrent Expert Attention Layers

	Built in Toronto. Apache 2.0.

	A 2-billion-parameter dense hybrid language model pretrained from scratch on
	500B–2T tokens. BOREAL-2B is the first model in the BOREAL family intended for
	actual downstream use — the one you download, fine-tune, quantize, and build on.
	It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and
	scales it to a size where benchmarks become meaningful.

	Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
	offering native 64K context — 4x what pure Transformers at this scale can
	practically support.

	## Architecture

	\| Component \| Detail \|
	\|-----------\|--------\|
	\| Type \| Dense hybrid — Gated DeltaNet + GQA \|
	\| Parameters \| ~2B \|
	\| Hidden size \| 2,048 \|
	\| Layers \| 32 (24 DeltaNet + 8 full attention) \|
	\| Ratio \| 3:1 linear-to-full attention \|
	\| Full attention \| GQA: 16 query heads, 4 KV heads, head_dim=256 \|
	\| DeltaNet \| Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 \|
	\| Conv kernel \| 4 \|
	\| FFN \| SwiGLU, intermediate=6,144 \|
	\| Norm \| RMSNorm, eps=1e-6 \|
	\| Position \| RoPE, theta=10M, partial_rotary_factor=0.25 \|
	\| Output gate \| Swish-gated \|
	\| Vocab \| 151,936 (Qwen3 tokenizer) \|
	\| Context \| 65,536 tokens native, extensible to 256K \|
	\| MTP \| 1 multi-token prediction head \|

	## Training

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Data tokens \| 500B–2T \|
	\| Corpus \| FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual \|
	\| Method \| Token Superposition Training (Nous Research) \|
	\| TST config \| s=4, r=0.5 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95) \|
	\| Peak LR \| 3e-4 \|
	\| Schedule \| Cosine decay to 10% peak \|
	\| Weight decay \| 0.1 \|
	\| Batch size \| ~4M tokens/step \|
	\| Precision \| BF16 weights, FP32 DeltaNet states \|
	\| Location \| Toronto, Ontario, Canada \|

	### Data Pipeline

	Built on Crucible — RUPS skyline weighting + EESD submodular diversity
	selection with formal (1 - 1/e) guarantees — and the DDM analyzer that
	models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same
	pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
	rows outperformed datasets 20–100x larger.

	### Training Phases

	```
	Phase 1 (TST): First 50% of tokens in superposition mode
	Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
	Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
	Phase 4 (Anneal): Crucible-selected high-quality data, DDM loss weights
	```

	## Expected Performance

	\| Benchmark \| Target \| Comparison \|
	\|-----------\|--------\|-------------\|
	\| HellaSwag \| 55–62% \| Qwen3-1.7B: ~58% \|
	\| ARC-Easy \| 65–72% \| Qwen3-1.7B: ~68% \|
	\| PIQA \| 72–78% \| Qwen3-1.7B: ~75% \|
	\| WinoGrande \| 58–64% \| Qwen3-1.7B: ~60% \|
	\| MMLU (5-shot) \| 28–35% \| Qwen3-1.7B: ~32% \|

	BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context
	length and using roughly half the inference memory at long context.

	## The BOREAL Family

	Every model trained in Canada. Every weight learned from random init.

	\| Model \| Params \| Type \| Context \| Status \|
	\|-------\|--------\|------\|---------\|--------\|
	\| [BOREAL-250M](https://huggingface.co/GestaltLabs/BOREAL-250M) \| 250M \| Dense \| 32K \| Seeking compute \|
	\| BOREAL-2B \| 2B \| Dense \| 64K \| Seeking compute \|
	\| [BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE) \| ~10B / ~2B active \| DeltaNet + MoE \| 256K \| Cluster required \|

	## License

	Apache 2.0. Built for Canadian researchers, startups, and institutions.
	No strings. No API keys. No foreign dependency.

	## Author

	Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
	University of Toronto. Toronto, Canada.

	[☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)