Upload README.md with huggingface_hub

f6fe4f8 verified about 8 hours ago

7.64 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - boreal
  - deltanet
  - hybrid
  - linear-attention
  - swiglu
  - rmsnorm
  - rope
  - gqa
  - pretraining
  - tst
  - crucible
  - ddm
  - submodular
  - data-curation
  - sovereign-ai
  - canadian-ai
  - edge
  - efficient
  - canada
pipeline_tag: text-generation
base_model: GestaltLabs/BOREAL-250M

BOREAL-250M — Sovereign Canadian AI

Balanced Orthogonal Recurrent Expert Attention Layers

Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute, anyone's models, or anyone's permission.

A 250M-parameter dense hybrid language model pretrained from scratch. Built on the Gated DeltaNet architecture — the same hybrid linear-attention design that powers Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum data efficiency per GPU-hour.

BOREAL-250M is the smallest member of the BOREAL family and the first proof point in a sovereign Canadian AI pipeline. It validates that a single researcher on a single GPU in Toronto can build competitive model architectures without relying on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition and imported infrastructure.

Why Canadian AI Sovereignty Matters

Every major Canadian AI deployment today runs on someone else's model. Qwen (Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco and Beijing. The models, the compute, and the decisions about what gets built stay elsewhere.

BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a Toronto apartment, an Apache 2.0 license, and an architecture that combines proven innovations from open research. No distillation from proprietary models. No dependency on anyone's API. Built here. Owned here.

Architecture

Component	Detail
Type	Dense hybrid — Gated DeltaNet + GQA
Parameters	250M
Hidden size	1,024
Layers	12 (9 DeltaNet + 3 full attention)
Ratio	3:1 linear-to-full attention
Full attention	GQA: 8 query heads, 2 KV heads, head_dim=256
DeltaNet	Gated linear attention: 8 QK heads, 16 V heads, head_dim=128
Conv kernel	4 (local context mixing)
FFN	SwiGLU, intermediate=3,072
Norm	RMSNorm, eps=1e-6
Position	RoPE, theta=10M, partial_rotary_factor=0.25
Output gate	Swish-gated attention outputs
Vocab	151,936 (Qwen3 tokenizer)
Context	32,768 tokens native
MTP	1 multi-token prediction head

Architecture Rationale

Gated DeltaNet over pure attention. 75% of layers use linear attention with data-dependent forgetting gates. Each DeltaNet layer maintains a fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t is a learned sigmoid gate controlling information retention. The result: O(n) on 75% of layers, enabling native long-context processing without quadratic memory blowup.

Larger head dims (256). Following Qwen3.5 and DeepSeek-V4, head_dim jumps from the traditional 128 to 256. Fewer heads with more per-head capacity, paired with aggressive GQA (4:1 query-to-KV ratio).

Partial RoPE (0.25). Only 25% of each head's dimensions receive rotary positional encoding. The remaining 75% pass through position-agnostically, creating a natural pathway for the DeltaNet's recurrent state.

Output gating. Every attention and DeltaNet output passes through a learned Swish gate: output = attention(x) * silu(W_gate * x).

Training

Parameter	Value
Data tokens	10B–200B (overtrained regime)
Corpus	FineWeb-Edu + StarCoder2 code
Training method	Token Superposition Training (Nous Research)
TST config	s=4 bags, r=0.5 fraction
Objective	Phase 1: multi-hot cross-entropy → Phase 2: standard NTP
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Peak LR	3e-4 (from MuP sweep)
Schedule	Cosine decay to 10% peak
Weight decay	0.1
Batch size	~4M tokens/step
Precision	BF16 weights, FP32 DeltaNet states
Location	Toronto, Ontario, Canada

Data Pipeline

BOREAL is built on Crucible — a submodular data selection framework:

RUPS (Reward-Utility Pareto Skyline): multi-axis scoring fusion computing per-sample weights from quality and complexity axes.
EESD (Embedding-Ensemble Submodular Diversity): lazy-greedy maximization of weighted facility-location objective, formal (1 - 1/e) approximation guarantee. Runs on 50K+ items without materializing the full pairwise similarity matrix.
Length-debiasing: stratified selection across response-length quantiles.

Samples are scored by the DDM analyzer (Drift Diffusion Model):

Reasoning traces segmented at cognitive boundaries
Per-segment quality extracted (self-correction, verification, exploration density)
Ornstein-Uhlenbeck evidence accumulation: dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt
Sustained boundary crossings flag degenerate reasoning
Samples classified into curriculum bins with per-sample loss weights

This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated rows outperformed datasets 20–100x larger. The 8-factor quality model backs it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness r=+0.852.

TST (Token Superposition Training)

Drop-in training acceleration from Nous Research (arXiv:2605.06546). First 50% of training uses superposed embeddings — 4 tokens averaged into one, predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput. Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero architecture changes.

Sovereign AI Scaling Ladder

Every model trained in Canada. Every weight learned from random init.

Model	Params	Type	Context	Status
BOREAL-250M	250M	Dense	32K	Planned — seeking compute
BOREAL-2B	2B	Dense DeltaNet	64K	Planned
BOREAL-10B-MoE	~10B / ~2B active	DeltaNet + MoE	256K	Planned

Expectations

BOREAL-250M is an architecture validation tool. Expect:

Coherent text generation
Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
Clean scaling curves through 200B+ tokens
Long-context advantage over pure Transformer baselines at 8K+

For a model you'd actually use, see BOREAL-2B.

License

Apache 2.0. Built for Canadian researchers, startups, and institutions. No strings. No API keys. No foreign dependency.

Author

Built by DJLougen / GestaltLabs.

PhD candidate in visual neuroscience, University of Toronto. Pretraining on a DGX Spark in a Toronto apartment. Compute self-funded. No institutional backing. No cloud credits. Just a thesis to finish and a conviction that Canada should own its own models.

☕ Support sovereign AI on Ko-fi