Upload README.md with huggingface_hub

f6fe4f8 verified about 11 hours ago

7.64 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- boreal
	- deltanet
	- hybrid
	- linear-attention
	- swiglu
	- rmsnorm
	- rope
	- gqa
	- pretraining
	- tst
	- crucible
	- ddm
	- submodular
	- data-curation
	- sovereign-ai
	- canadian-ai
	- edge
	- efficient
	- canada
	pipeline_tag: text-generation
	base_model: GestaltLabs/BOREAL-250M
	---

	![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-250M/resolve/main/Boreal.png)

	# BOREAL-250M — Sovereign Canadian AI

	Balanced Orthogonal Recurrent Expert Attention Layers

	Built in Toronto. Trained on Canadian soil. Not dependent on anyone's compute,
	anyone's models, or anyone's permission.

	A 250M-parameter dense hybrid language model pretrained from scratch. Built on the
	Gated DeltaNet architecture — the same hybrid linear-attention design that powers
	Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum
	data efficiency per GPU-hour.

	BOREAL-250M is the smallest member of the BOREAL family and the first proof point
	in a sovereign Canadian AI pipeline. It validates that a single researcher on a
	single GPU in Toronto can build competitive model architectures without relying
	on US hyperscaler compute, Chinese base models, or EU consortia. The boreal forest
	covers 60% of Canada. BOREAL models cover the gap between Canadian AI ambition
	and imported infrastructure.

	## Why Canadian AI Sovereignty Matters

	Every major Canadian AI deployment today runs on someone else's model. Qwen
	(Alibaba). Llama (Meta). DeepSeek. Mistral. Claude. Canada produces world-class
	AI researchers — UofT, Mila, Vector, Amii — and ships them to San Francisco
	and Beijing. The models, the compute, and the decisions about what gets built
	stay elsewhere.

	BOREAL is a bet that this doesn't have to be true. A single DGX Spark in a
	Toronto apartment, an Apache 2.0 license, and an architecture that combines
	proven innovations from open research. No distillation from proprietary models.
	No dependency on anyone's API. Built here. Owned here.

	## Architecture

	\| Component \| Detail \|
	\|-----------\|--------\|
	\| Type \| Dense hybrid — Gated DeltaNet + GQA \|
	\| Parameters \| 250M \|
	\| Hidden size \| 1,024 \|
	\| Layers \| 12 (9 DeltaNet + 3 full attention) \|
	\| Ratio \| 3:1 linear-to-full attention \|
	\| Full attention \| GQA: 8 query heads, 2 KV heads, head_dim=256 \|
	\| DeltaNet \| Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 \|
	\| Conv kernel \| 4 (local context mixing) \|
	\| FFN \| SwiGLU, intermediate=3,072 \|
	\| Norm \| RMSNorm, eps=1e-6 \|
	\| Position \| RoPE, theta=10M, partial_rotary_factor=0.25 \|
	\| Output gate \| Swish-gated attention outputs \|
	\| Vocab \| 151,936 (Qwen3 tokenizer) \|
	\| Context \| 32,768 tokens native \|
	\| MTP \| 1 multi-token prediction head \|

	### Architecture Rationale

	Gated DeltaNet over pure attention. 75% of layers use linear attention
	with data-dependent forgetting gates. Each DeltaNet layer maintains a
	fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t
	is a learned sigmoid gate controlling information retention. The result: O(n)
	on 75% of layers, enabling native long-context processing without quadratic
	memory blowup.

	Larger head dims (256). Following Qwen3.5 and DeepSeek-V4, head_dim jumps
	from the traditional 128 to 256. Fewer heads with more per-head capacity,
	paired with aggressive GQA (4:1 query-to-KV ratio).

	Partial RoPE (0.25). Only 25% of each head's dimensions receive rotary
	positional encoding. The remaining 75% pass through position-agnostically,
	creating a natural pathway for the DeltaNet's recurrent state.

	Output gating. Every attention and DeltaNet output passes through a
	learned Swish gate: `output = attention(x) * silu(W_gate * x)`.

	## Training

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Data tokens \| 10B–200B (overtrained regime) \|
	\| Corpus \| FineWeb-Edu + StarCoder2 code \|
	\| Training method \| Token Superposition Training (Nous Research) \|
	\| TST config \| s=4 bags, r=0.5 fraction \|
	\| Objective \| Phase 1: multi-hot cross-entropy → Phase 2: standard NTP \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95) \|
	\| Peak LR \| 3e-4 (from MuP sweep) \|
	\| Schedule \| Cosine decay to 10% peak \|
	\| Weight decay \| 0.1 \|
	\| Batch size \| ~4M tokens/step \|
	\| Precision \| BF16 weights, FP32 DeltaNet states \|
	\| Location \| Toronto, Ontario, Canada \|

	### Data Pipeline

	BOREAL is built on Crucible — a submodular data selection framework:

	- RUPS (Reward-Utility Pareto Skyline): multi-axis scoring fusion
	computing per-sample weights from quality and complexity axes.

	- EESD (Embedding-Ensemble Submodular Diversity): lazy-greedy
	maximization of weighted facility-location objective, formal
	(1 - 1/e) approximation guarantee. Runs on 50K+ items without
	materializing the full pairwise similarity matrix.

	- Length-debiasing: stratified selection across response-length
	quantiles.

	Samples are scored by the DDM analyzer (Drift Diffusion Model):

	1. Reasoning traces segmented at cognitive boundaries
	2. Per-segment quality extracted (self-correction, verification,
	exploration density)
	3. Ornstein-Uhlenbeck evidence accumulation: `dx = θ(μ - x)·dt + σ·dW + v·signal(t)·dt`
	4. Sustained boundary crossings flag degenerate reasoning
	5. Samples classified into curriculum bins with per-sample loss weights

	This pipeline produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
	rows outperformed datasets 20–100x larger. The 8-factor quality model backs
	it: lexical diversity r=+0.967, semantic diversity r=+0.947, verb uniqueness
	r=+0.852.

	### TST (Token Superposition Training)

	Drop-in training acceleration from Nous Research (arXiv:2605.06546). First
	50% of training uses superposed embeddings — 4 tokens averaged into one,
	predicting the next bag of 4 via multi-hot cross-entropy. 4x data throughput.
	Second 50% reverts to standard NTP. 1.5–2.5x wall-time reduction with zero
	architecture changes.

	## Sovereign AI Scaling Ladder

	Every model trained in Canada. Every weight learned from random init.

	\| Model \| Params \| Type \| Context \| Status \|
	\|-------\|--------\|------\|---------\|--------\|
	\| BOREAL-250M \| 250M \| Dense \| 32K \| Planned — seeking compute \|
	\| [BOREAL-2B](https://huggingface.co/GestaltLabs/BOREAL-2B) \| 2B \| Dense DeltaNet \| 64K \| Planned \|
	\| [BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE) \| ~10B / ~2B active \| DeltaNet + MoE \| 256K \| Planned \|

	## Expectations

	BOREAL-250M is an architecture validation tool. Expect:
	- Coherent text generation
	- Above-random benchmarks: 35–40% HellaSwag, 55–60% ARC-Easy
	- Clean scaling curves through 200B+ tokens
	- Long-context advantage over pure Transformer baselines at 8K+

	For a model you'd actually use, see BOREAL-2B.

	## License

	Apache 2.0. Built for Canadian researchers, startups, and institutions.
	No strings. No API keys. No foreign dependency.

	## Author

	Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).

	PhD candidate in visual neuroscience, University of Toronto.
	Pretraining on a DGX Spark in a Toronto apartment.
	Compute self-funded. No institutional backing. No cloud credits.
	Just a thesis to finish and a conviction that Canada should
	own its own models.

	[☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)