Upload README.md with huggingface_hub

e047be9 verified about 9 hours ago

4.52 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - boreal
  - deltanet
  - hybrid
  - linear-attention
  - swiglu
  - rmsnorm
  - rope
  - gqa
  - pretraining
  - tst
  - crucible
  - ddm
  - submodular
  - data-curation
  - sovereign-ai
  - canadian-ai
  - community
  - canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-2B

BOREAL-2B — Canadian Sovereign AI

Balanced Orthogonal Recurrent Expert Attention Layers

Built in Toronto. Apache 2.0.

A 2-billion-parameter dense hybrid language model pretrained from scratch on 500B–2T tokens. BOREAL-2B is the first model in the BOREAL family intended for actual downstream use — the one you download, fine-tune, quantize, and build on. It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and scales it to a size where benchmarks become meaningful.

Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while offering native 64K context — 4x what pure Transformers at this scale can practically support.

Architecture

Component	Detail
Type	Dense hybrid — Gated DeltaNet + GQA
Parameters	~2B
Hidden size	2,048
Layers	32 (24 DeltaNet + 8 full attention)
Ratio	3:1 linear-to-full attention
Full attention	GQA: 16 query heads, 4 KV heads, head_dim=256
DeltaNet	Gated linear attention: 8 QK heads, 16 V heads, head_dim=128
Conv kernel	4
FFN	SwiGLU, intermediate=6,144
Norm	RMSNorm, eps=1e-6
Position	RoPE, theta=10M, partial_rotary_factor=0.25
Output gate	Swish-gated
Vocab	151,936 (Qwen3 tokenizer)
Context	65,536 tokens native, extensible to 256K
MTP	1 multi-token prediction head

Training

Parameter	Value
Data tokens	500B–2T
Corpus	FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual
Method	Token Superposition Training (Nous Research)
TST config	s=4, r=0.5
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Peak LR	3e-4
Schedule	Cosine decay to 10% peak
Weight decay	0.1
Batch size	~4M tokens/step
Precision	BF16 weights, FP32 DeltaNet states
Location	Toronto, Ontario, Canada

Data Pipeline

Built on Crucible — RUPS skyline weighting + EESD submodular diversity selection with formal (1 - 1/e) guarantees — and the DDM analyzer that models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated rows outperformed datasets 20–100x larger.

Training Phases

Phase 1 (TST):     First 50% of tokens in superposition mode
Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
Phase 4 (Anneal):   Crucible-selected high-quality data, DDM loss weights

Expected Performance

Benchmark	Target	Comparison
HellaSwag	55–62%	Qwen3-1.7B: ~58%
ARC-Easy	65–72%	Qwen3-1.7B: ~68%
PIQA	72–78%	Qwen3-1.7B: ~75%
WinoGrande	58–64%	Qwen3-1.7B: ~60%
MMLU (5-shot)	28–35%	Qwen3-1.7B: ~32%

BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context length and using roughly half the inference memory at long context.

The BOREAL Family

Every model trained in Canada. Every weight learned from random init.

Model	Params	Type	Context	Status
BOREAL-250M	250M	Dense	32K	Seeking compute
BOREAL-2B	2B	Dense	64K	Seeking compute
BOREAL-10B-MoE	~10B / ~2B active	DeltaNet + MoE	256K	Cluster required

License

Apache 2.0. Built for Canadian researchers, startups, and institutions. No strings. No API keys. No foreign dependency.

Author

Built by DJLougen / GestaltLabs. University of Toronto. Toronto, Canada.

☕ Support sovereign AI on Ko-fi