DJLougen commited on
Commit
ca37c15
Β·
verified Β·
1 Parent(s): 6284224

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +131 -135
README.md CHANGED
@@ -1,135 +1,131 @@
1
- ---
2
- language:
3
- - en
4
- license: apache-2.0
5
- library_name: transformers
6
- tags:
7
- - boreal
8
- - deltanet
9
- - hybrid
10
- - linear-attention
11
- - swiglu
12
- - rmsnorm
13
- - rope
14
- - gqa
15
- - pretraining
16
- - release
17
- - community
18
- - canada
19
- pipeline_tag: text-generation
20
- base_model: DJLougen/BOREAL-2B
21
- ---
22
-
23
- # BOREAL-2B
24
-
25
- **B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
26
-
27
- A 2-billion-parameter dense hybrid language model pretrained from scratch on
28
- 500B–2T tokens. BOREAL-2B is the community release β€” the first model in the
29
- family intended for actual downstream use. It carries forward the Gated
30
- DeltaNet architecture validated by BOREAL-250M and scales it to a size where
31
- benchmarks become meaningful.
32
-
33
- This is the model you download, fine-tune, quantize, and build on. BOREAL-2B
34
- targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
35
- offering native long-context support that pure Transformers at this scale
36
- can't match.
37
-
38
- ## Architecture
39
-
40
- | Component | Detail |
41
- |-----------|--------|
42
- | **Type** | Dense hybrid β€” Gated DeltaNet + GQA |
43
- | **Parameters** | ~2B |
44
- | **Hidden size** | 2,048 |
45
- | **Layers** | 32 (24 DeltaNet + 8 full attention) |
46
- | **Ratio** | 3:1 linear-to-full attention |
47
- | **Full attention** | GQA: 16 query heads, 4 KV heads, head_dim=256 |
48
- | **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
49
- | **Conv kernel** | 4 |
50
- | **FFN** | SwiGLU, intermediate=6,144 |
51
- | **Norm** | RMSNorm, eps=1e-6 |
52
- | **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
53
- | **Output gate** | Swish-gated attention outputs |
54
- | **Vocab** | 151,936 (Qwen3 tokenizer) |
55
- | **Context** | 65,536 tokens native, extensible to 256K |
56
- | **MTP** | 1 multi-token prediction head |
57
-
58
- ## Training
59
-
60
- | Parameter | Value |
61
- |-----------|-------|
62
- | **Data tokens** | 500B–2T (overtrained regime) |
63
- | **Corpus** | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
64
- | **TST** | Token Superposition Training, s=4, r=0.5 |
65
- | **Optimizer** | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
66
- | **Peak LR** | 3e-4 (from BOREAL-250M MuP transfer) |
67
- | **Schedule** | Cosine decay to 10% of peak |
68
- | **Weight decay** | 0.1 |
69
- | **Batch size** | ~4M tokens/step |
70
- | **Precision** | BF16 weights, FP32 DeltaNet states |
71
- | **Hardware** | 1–2Γ— H200 or equivalent |
72
-
73
- ### Training Phases
74
-
75
- ```
76
- Phase 1 (TST, s=4): First 50% of tokens in superposition mode
77
- Multi-hot CE objective, 4x data throughput
78
-
79
- Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
80
- Model recovers token-level precision
81
-
82
- Phase 3 (MSE): Mid-training extension at 64K context (~50B tokens)
83
- YaRN RoPE scaling to target context length
84
- ```
85
-
86
- ## Expected Performance
87
-
88
- | Benchmark | Target | Comparison |
89
- |-----------|--------|-------------|
90
- | HellaSwag | 55–62% | Qwen3-1.7B: ~58% |
91
- | ARC-Easy | 65–72% | Qwen3-1.7B: ~68% |
92
- | PIQA | 72–78% | Qwen3-1.7B: ~75% |
93
- | WinoGrande | 58–64% | Qwen3-1.7B: ~60% |
94
- | MMLU (5-shot) | 28–35% | Qwen3-1.7B: ~32% |
95
-
96
- BOREAL-2B targets parity with Qwen3-1.7B on core benchmarks while supporting
97
- 4x the native context length (64K vs 16K) and using roughly half the inference
98
- memory at long context due to DeltaNet's fixed-size recurrent state.
99
-
100
- ## What Makes It Different
101
-
102
- **Not a fine-tune.** Unlike most models at this scale (which are LoRA adapters
103
- on larger base models), BOREAL-2B is pretrained from scratch. Every weight was
104
- learned on the BOREAL architecture from random initialization. This means:
105
-
106
- - Full control over the data mixture and training curriculum
107
- - No inherited biases or ablations from upstream models
108
- - Validated scaling laws that transfer to BOREAL-10B-MoE
109
- - Clean Apache 2.0 lineage without entanglement
110
-
111
- **Real long context.** Pure Transformer models at 2B params struggle with
112
- contexts beyond 8K–16K due to KV cache memory pressure. BOREAL-2B processes
113
- 75% of layers in O(n) time with a fixed-size state, so 64K context costs
114
- roughly the same as 16K context on a Transformer baseline.
115
-
116
- ## The BOREAL Family
117
-
118
- | Model | Params | Type | Context | Status |
119
- |-------|--------|------|---------|--------|
120
- | **[BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M)** | 250M | Dense DeltaNet | 32K | Architecture validation |
121
- | **BOREAL-2B** | 2B | Dense DeltaNet | 64K | Community release |
122
- | **[BOREAL-10B-MoE](https://huggingface.co/DJLougen/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Target model |
123
-
124
- ## License
125
-
126
- Apache 2.0.
127
-
128
- ## Author
129
-
130
- Developed by [DJLougen](https://huggingface.co/DJLougen).
131
-
132
- Trained in Toronto. Compute self-funded. Architecture decisions informed by
133
- Qwen3.5, DeepSeek-V4, Nemotron 3, and Nous Research's TST.
134
-
135
- [β˜• Support on Ko-fi](https://ko-fi.com/djlougen)
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - boreal
8
+ - deltanet
9
+ - hybrid
10
+ - linear-attention
11
+ - swiglu
12
+ - rmsnorm
13
+ - rope
14
+ - gqa
15
+ - pretraining
16
+ - tst
17
+ - crucible
18
+ - ddm
19
+ - submodular
20
+ - data-curation
21
+ - sovereign-ai
22
+ - canadian-ai
23
+ - community
24
+ - canada
25
+ pipeline_tag: text-generation
26
+ base_model: GestaltLabs/BOREAL-2B
27
+ ---
28
+
29
+ # BOREAL-2B β€” Canadian Community Release
30
+
31
+ **B**alanced **O**rthogonal **R**ecurrent **E**xpert **A**ttention **L**ayers
32
+
33
+ Built in Toronto. Apache 2.0. No API key. No foreign dependency.
34
+
35
+ A 2-billion-parameter dense hybrid language model pretrained from scratch on
36
+ 500B–2T tokens. BOREAL-2B is the first model in the BOREAL family intended for
37
+ actual downstream use β€” the one you download, fine-tune, quantize, and build on.
38
+ It carries forward the Gated DeltaNet architecture validated by BOREAL-250M and
39
+ scales it to a size where benchmarks become meaningful.
40
+
41
+ Targets competitive performance against Qwen3-1.7B and SmolLM2-1.7B while
42
+ offering native 64K context β€” 4x what pure Transformers at this scale can
43
+ practically support.
44
+
45
+ ## Architecture
46
+
47
+ | Component | Detail |
48
+ |-----------|--------|
49
+ | **Type** | Dense hybrid β€” Gated DeltaNet + GQA |
50
+ | **Parameters** | ~2B |
51
+ | **Hidden size** | 2,048 |
52
+ | **Layers** | 32 (24 DeltaNet + 8 full attention) |
53
+ | **Ratio** | 3:1 linear-to-full attention |
54
+ | **Full attention** | GQA: 16 query heads, 4 KV heads, head_dim=256 |
55
+ | **DeltaNet** | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
56
+ | **Conv kernel** | 4 |
57
+ | **FFN** | SwiGLU, intermediate=6,144 |
58
+ | **Norm** | RMSNorm, eps=1e-6 |
59
+ | **Position** | RoPE, theta=10M, partial_rotary_factor=0.25 |
60
+ | **Output gate** | Swish-gated |
61
+ | **Vocab** | 151,936 (Qwen3 tokenizer) |
62
+ | **Context** | 65,536 tokens native, extensible to 256K |
63
+ | **MTP** | 1 multi-token prediction head |
64
+
65
+ ## Training
66
+
67
+ | Parameter | Value |
68
+ |-----------|-------|
69
+ | **Data tokens** | 500B–2T |
70
+ | **Corpus** | FineWeb-Edu + StarCoder2 + OpenWebMath + curated multilingual |
71
+ | **Method** | Token Superposition Training (Nous Research) |
72
+ | **TST config** | s=4, r=0.5 |
73
+ | **Optimizer** | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
74
+ | **Peak LR** | 3e-4 |
75
+ | **Schedule** | Cosine decay to 10% peak |
76
+ | **Weight decay** | 0.1 |
77
+ | **Batch size** | ~4M tokens/step |
78
+ | **Precision** | BF16 weights, FP32 DeltaNet states |
79
+ | **Location** | Toronto, Ontario, Canada |
80
+
81
+ ### Data Pipeline
82
+
83
+ Built on **Crucible** β€” RUPS skyline weighting + EESD submodular diversity
84
+ selection with formal (1 - 1/e) guarantees β€” and the **DDM analyzer** that
85
+ models reasoning as Ornstein-Uhlenbeck evidence accumulation. The same
86
+ pipeline that produced Harmonic-9B and Ornstein-27B, where 799 DDM-curated
87
+ rows outperformed datasets 20–100x larger.
88
+
89
+ ### Training Phases
90
+
91
+ ```
92
+ Phase 1 (TST): First 50% of tokens in superposition mode
93
+ Phase 2 (Recovery): Remaining 50% as standard autoregressive NTP
94
+ Phase 3 (Extension): Mid-training at 64K context, YaRN scaling
95
+ Phase 4 (Anneal): Crucible-selected high-quality data, DDM loss weights
96
+ ```
97
+
98
+ ## Expected Performance
99
+
100
+ | Benchmark | Target | Comparison |
101
+ |-----------|--------|-------------|
102
+ | HellaSwag | 55–62% | Qwen3-1.7B: ~58% |
103
+ | ARC-Easy | 65–72% | Qwen3-1.7B: ~68% |
104
+ | PIQA | 72–78% | Qwen3-1.7B: ~75% |
105
+ | WinoGrande | 58–64% | Qwen3-1.7B: ~60% |
106
+ | MMLU (5-shot) | 28–35% | Qwen3-1.7B: ~32% |
107
+
108
+ BOREAL-2B targets parity with Qwen3-1.7B while supporting 4x the native context
109
+ length and using roughly half the inference memory at long context.
110
+
111
+ ## The BOREAL Family
112
+
113
+ Every model trained in Canada. Every weight learned from random init.
114
+
115
+ | Model | Params | Type | Context | Status |
116
+ |-------|--------|------|---------|--------|
117
+ | **[BOREAL-250M](https://huggingface.co/GestaltLabs/BOREAL-250M)** | 250M | Dense | 32K | Seeking compute |
118
+ | **BOREAL-2B** | 2B | Dense | 64K | Seeking compute |
119
+ | **[BOREAL-10B-MoE](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE)** | ~10B / ~2B active | DeltaNet + MoE | 256K | Cluster required |
120
+
121
+ ## License
122
+
123
+ Apache 2.0. Built for Canadian researchers, startups, and institutions.
124
+ No strings. No API keys. No foreign dependency.
125
+
126
+ ## Author
127
+
128
+ Built by [DJLougen](https://huggingface.co/DJLougen) / [GestaltLabs](https://huggingface.co/GestaltLabs).
129
+ University of Toronto. Toronto, Canada.
130
+
131
+ [β˜• Support sovereign AI on Ko-fi](https://ko-fi.com/djlougen)