GestaltLabs
/

BOREAL-2B

Model card Files Files and versions

xet

Community

DJLougen commited on about 13 hours ago

Commit

ca37c15

verified ·

1 Parent(s): 6284224

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +131 -135

README.md CHANGED Viewed

@@ -1,135 +1,131 @@
----
-language:
-- en
-license: apache-2.0
-library_name: transformers
-tags:
-- boreal
-- deltanet
-- hybrid
-- linear-attention
-- swiglu
-- rmsnorm
-- rope
-- gqa
-- pretraining
-- release
-- community
-- canada
-pipeline_tag: text-generation
-base_model: DJLougen/BOREAL-2B
----
-# BOREAL-2B
-**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
-A 2-billion-parameter dense hybrid language model pretrained from scratch on
-500B–2T tokens. BOREAL-2B is the community release — the first model in the
-family intended for actual downstream use. It carries forward the Gated
-DeltaNet architecture validated by BOREAL-250M and scales it to a size where
-benchmarks become meaningful.
-This is the model you download, fine-tune, quantize, and build on. BOREAL-2B
-targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
-offering native long-context support that pure Transformers at this scale
-can't match.
-## Architecture
-| Component | Detail |
-|-----------|--------|
-| **Type** | Dense hybrid — Gated DeltaNet + GQA |
-| **Parameters** | ~2B |
-| **Hidden size** | 2,048 |
-| **Layers** | 32 (24 DeltaNet + 8 full attention) |
-| **Ratio** | 3:1 linear-to-full attention |
-| **Full attention** | GQA: 16 query heads, 4 KV heads, head_dim=256 |
-| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
-| **Conv kernel** | 4 |
-| **FFN** | SwiGLU, intermediate=6,144 |
-| **Norm** | RMSNorm, eps=1e-6 |
-| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
-| **Output gate** | Swish-gated attention outputs |
-| **Vocab** | 151,936 (Qwen3 tokenizer) |
-| **Context** | 65,536 tokens native, extensible to 256K |
-| **MTP** | 1 multi-token prediction head |
-## Training
-| Parameter | Value |
-|-----------|-------|
-| **Data tokens** | 500B–2T (overtrained regime) |
-| **Corpus** | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
-| **TST** | Token Superposition Training, s=4, r=0.5 |
-| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
-| **Peak LR** | 3e-4 (from BOREAL-250M MuP transfer) |
-| **Schedule** | Cosine decay to 10% of peak |
-| **Weight decay** | 0.1 |
-| **Batch size** | ~4M tokens/step |
-| **Precision** | BF16 weights, FP32 DeltaNet states |
-| **Hardware** | 1–2× H200 or equivalent |
-### Training Phases
-```
-Phase 1 (TST, s=4):  First 50% of tokens in superposition mode
-                     Multi-hot CE objective, 4x data throughput
-Phase 2 (Recovery):  Remaining 50% as standard autoregressive NTP
-                     Model recovers token-level precision
-Phase 3 (MSE):       Mid-training extension at 64K context (~50B tokens)
-                     YaRN RoPE scaling to target context length
-```
-## Expected Performance
-| Benchmark | Target | Comparison |
-|-----------|--------|-------------|
-| HellaSwag | 55–62% | Qwen3-1.7B: ~58% |
-| ARC-Easy | 65–72% | Qwen3-1.7B: ~68% |
-| PIQA | 72–78% | Qwen3-1.7B: ~75% |
-| WinoGrande | 58–64% | Qwen3-1.7B: ~60% |
-| MMLU (5-shot) | 28–35% | Qwen3-1.7B: ~32% |
-BOREAL-2B targets parity with Qwen3-1.7B on core benchmarks while supporting
-4x the native context length (64K vs 16K) and using roughly half the inference
-memory at long context due to DeltaNet's fixed-size recurrent state.
-## What Makes It Different
-**Not a fine-tune.** Unlike most models at this scale (which are LoRA adapters
-on larger base models), BOREAL-2B is pretrained from scratch. Every weight was
-learned on the BOREAL architecture from random initialization. This means:
-- Full control over the data mixture and training curriculum
-- No inherited biases or ablations from upstream models
-- Validated scaling laws that transfer to BOREAL-10B-MoE
-- Clean Apache 2.0 lineage without entanglement
-**Real long context.** Pure Transformer models at 2B params struggle with
-contexts beyond 8K–16K due to KV cache memory pressure. BOREAL-2B processes
-75% of layers in O(n) time with a fixed-size state, so 64K context costs
-roughly the same as 16K context on a Transformer baseline.
-## The BOREAL Family
-| Model | Params | Type | Context | Status |
-|-------|--------|------|---------|--------|
-| **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation |
-| **BOREAL-2B** | 2B | Dense DeltaNet | 64K | Community release |
-| **[BOREAL-10B-MoE](https://huggingface.co/DJLougen/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |
-## License
-Apache 2.0.
-## Author
-Developed by [DJLougen](https://huggingface.co/DJLougen).
-Trained in Toronto. Compute self-funded. Architecture decisions informed by
-Qwen3.5, DeepSeek-V4, Nemotron 3, and Nous Research's TST.
-[☕ Support on Ko-fi](https://ko-fi.com/djlougen)

+---
+language:
+- en
+license: apache-2.0
+library_name: transformers
+tags:
+- boreal
+- deltanet
+- hybrid
+- linear-attention
+- swiglu
+- rmsnorm
+- rope
+- gqa
+- pretraining
+- tst
+- crucible
+- ddm
+- submodular
+- data-curation
+- sovereign-ai
+- canadian-ai
+- community
+- canada
+pipeline_tag: text-generation
+base_model: GestaltLabs/BOREAL-2B
+---
+# BOREAL-2B — Canadian Community Release
+**B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
+Built in Toronto. Apache 2.0. No API key. No foreign dependency.
+A 2-billion-parameter dense hybrid language model pretrained from scratch on
+500B–2T tokens. BOREAL-2B is the first model in the BOREAL family intended for
+actual downstream use — the one you download, fine-tune, quantize, and build on.
+It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and
+scales it to a size where benchmarks become meaningful.
+Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
+offering native 64K context — 4x what pure Transformers at this scale can
+practically support.
+## Architecture
+| Component | Detail |
+|-----------|--------|
+| **Type** | Dense hybrid — Gated DeltaNet + GQA |
+| **Parameters** | ~2B |
+| **Hidden size** | 2,048 |
+| **Layers** | 32 (24 DeltaNet + 8 full attention) |
+| **Ratio** | 3:1 linear-to-full attention |
+| **Full attention** | GQA: 16 query heads, 4 KV heads, head_dim=256 |
+| **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
+| **Conv kernel** | 4 |
+| **FFN** | SwiGLU, intermediate=6,144 |
+| **Norm** | RMSNorm, eps=1e-6 |
+| **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
+| **Output gate** | Swish-gated |
+| **Vocab** | 151,936 (Qwen3 tokenizer) |
+| **Context** | 65,536 tokens native, extensible to 256K |
+| **MTP** | 1 multi-token prediction head |
+## Training
+| Parameter | Value |
+|-----------|-------|
+| **Data tokens** | 500B–2T |
+| **Corpus** | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
+| **Method** | Token Superposition Training (Nous Research) |
+| **TST config** | s=4, r=0.5 |
+| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
+| **Peak LR** | 3e-4 |
+| **Schedule** | Cosine decay to 10% peak |
+| **Weight decay** | 0.1 |
+| **Batch size** | ~4M tokens/step |
+| **Precision** | BF16 weights, FP32 DeltaNet states |
+| **Location** | Toronto, Ontario, Canada |
+### Data Pipeline
+Built on **Crucible** — RUPS skyline weighting + EESD submodular diversity
+selection with formal (1 - 1/e) guarantees — and the **DDM analyzer** that
+models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same
+pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
+rows outperformed datasets 20–100x larger.
+### Training Phases
+```
+Phase 1 (TST):     First 50% of tokens in superposition mode
+Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
+Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
+Phase 4 (Anneal):   Crucible-selected high-quality data, DDM loss weights
+```
+## Expected Performance
+| Benchmark | Target | Comparison |
+|-----------|--------|-------------|
+| HellaSwag | 55–62% | Qwen3-1.7B: ~58% |
+| ARC-Easy | 65–72% | Qwen3-1.7B: ~68% |
+| PIQA | 72–78% | Qwen3-1.7B: ~75% |
+| WinoGrande | 58–64% | Qwen3-1.7B: ~60% |
+| MMLU (5-shot) | 28–35% | Qwen3-1.7B: ~32% |
+BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context
+length and using roughly half the inference memory at long context.
+## The BOREAL Family
+Every model trained in Canada. Every weight learned from random init.
+| Model | Params | Type | Context | Status |
+|-------|--------|------|---------|--------|
+| **[BOREAL-250M](https://huggingface.co/GestaltLabs/BOREAL-250M)** | 250M | Dense | 32K | Seeking compute |
+| **BOREAL-2B** | 2B | Dense | 64K | Seeking compute |
+| **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Cluster required |
+## License
+Apache 2.0. Built for Canadian researchers, startups, and institutions.
+No strings. No API keys. No foreign dependency.
+## Author
+Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
+University of Toronto. Toronto, Canada.
+[☕ Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)