BOREAL-2B / README.md
DJLougen's picture
Upload README.md with huggingface_hub
e047be9 verified
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - boreal
  - deltanet
  - hybrid
  - linear-attention
  - swiglu
  - rmsnorm
  - rope
  - gqa
  - pretraining
  - tst
  - crucible
  - ddm
  - submodular
  - data-curation
  - sovereign-ai
  - canadian-ai
  - community
  - canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-2B

BOREAL

BOREAL-2B β€” Canadian Sovereign AI

Balanced Orthogonal Recurrent Expert Attention Layers

Built in Toronto. Apache 2.0.

A 2-billion-parameter dense hybrid language model pretrained from scratch on 500B–2T tokens. BOREAL-2B is the first model in the BOREAL family intended for actual downstream use β€” the one you download, fine-tune, quantize, and build on. It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and scales it to a size where benchmarks become meaningful.

Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while offering native 64K context β€” 4x what pure Transformers at this scale can practically support.

Architecture

Component Detail
Type Dense hybrid β€” Gated DeltaNet + GQA
Parameters ~2B
Hidden size 2,048
Layers 32 (24 DeltaNet + 8 full attention)
Ratio 3:1 linear-to-full attention
Full attention GQA: 16 query heads, 4 KV heads, head_dim=256
DeltaNet Gated linear attention: 8 QK heads, 16 V heads, head_dim=128
Conv kernel 4
FFN SwiGLU, intermediate=6,144
Norm RMSNorm, eps=1e-6
Position RoPE, theta=10M, partial_rotary_factor=0.25
Output gate Swish-gated
Vocab 151,936 (Qwen3 tokenizer)
Context 65,536 tokens native, extensible to 256K
MTP 1 multi-token prediction head

Training

Parameter Value
Data tokens 500B–2T
Corpus FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual
Method Token Superposition Training (Nous Research)
TST config s=4, r=0.5
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95)
Peak LR 3e-4
Schedule Cosine decay to 10% peak
Weight decay 0.1
Batch size ~4M tokens/step
Precision BF16 weights, FP32 DeltaNet states
Location Toronto, Ontario, Canada

Data Pipeline

Built on Crucible β€” RUPS skyline weighting + EESD submodular diversity selection with formal (1 - 1/e) guarantees β€” and the DDM analyzer that models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated rows outperformed datasets 20–100x larger.

Training Phases

Phase 1 (TST):     First 50% of tokens in superposition mode
Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
Phase 4 (Anneal):   Crucible-selected high-quality data, DDM loss weights

Expected Performance

Benchmark Target Comparison
HellaSwag 55–62% Qwen3-1.7B: ~58%
ARC-Easy 65–72% Qwen3-1.7B: ~68%
PIQA 72–78% Qwen3-1.7B: ~75%
WinoGrande 58–64% Qwen3-1.7B: ~60%
MMLU (5-shot) 28–35% Qwen3-1.7B: ~32%

BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context length and using roughly half the inference memory at long context.

The BOREAL Family

Every model trained in Canada. Every weight learned from random init.

Model Params Type Context Status
BOREAL-250M 250M Dense 32K Seeking compute
BOREAL-2B 2B Dense 64K Seeking compute
BOREAL-10B-MoE ~10B / ~2B active DeltaNet + MoE 256K Cluster required

License

Apache 2.0. Built for Canadian researchers, startups, and institutions. No strings. No API keys. No foreign dependency.

Author

Built by DJLougen / GestaltLabs. University of Toronto. Toronto, Canada.

β˜• Support sovereign AI on Ko-fi