Upload README.md with huggingface_hub

9855b1d verified 1 day ago

9.68 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- boreal
	- deltanet
	- hybrid
	- moe
	- mixture-of-experts
	- linear-attention
	- swiglu
	- rmsnorm
	- rope
	- gqa
	- deepseek-routing
	- hash-routing
	- pretraining
	- canada
	pipeline_tag: text-generation
	base_model: DJLougen/BOREAL-10B-MoE
	---

	![BOREAL](https://huggingface.co/GestaltLabs/BOREAL-10B-MoE/resolve/main/Boreal.png)

	# BOREAL-10B-MoE

	Balanced Orthogonal Recurrent Expert Attention Layers

	The target. A ~10-billion-parameter Mixture-of-Experts hybrid language model
	with ~2 billion active parameters per token. Trained from scratch on 15–20
	trillion tokens using Token Superposition Training.

	BOREAL-10B-MoE combines the Gated DeltaNet architecture validated by Qwen3.5/3.6
	with DeepSeek-V4's hash-based expert routing. The result: a model that punches
	at Qwen3.5-9B levels with ~2B active parameters, 256K native context, and
	inference throughput competitive with models 4–5x its active size.

	## Architecture

	\| Component \| Detail \|
	\|-----------\|--------\|
	\| Type \| Hybrid MoE — Gated DeltaNet + GQA + sparse experts \|
	\| Total parameters \| ~10B \|
	\| Active parameters \| ~2B per token \|
	\| Hidden size \| 2,560 \|
	\| Layers \| 40 (10 full attention + 20 DeltaNet + 10 MoE FFN) \|
	\| Full attention \| GQA: 20 query heads, 4 KV heads, head_dim=256 \|
	\| DeltaNet \| Gated linear attention: 8 QK heads, 32 V heads, head_dim=128 \|
	\| Conv kernel \| 4 \|
	\| Routed experts \| 128 total, 8 active per token \|
	\| Expert FFN \| SwiGLU, intermediate=768 per expert \|
	\| Shared expert \| 1 always-active expert, intermediate=1,536 \|
	\| Expert routing \| Sigmoid scoring + noaux_tc (zero auxiliary loss) \|
	\| Dense FFN \| SwiGLU, intermediate=7,680 (non-MoE layers) \|
	\| Norm \| RMSNorm, eps=1e-6 \|
	\| Position \| RoPE, theta=10M, partial_rotary_factor=0.25 \|
	\| Output gate \| Swish-gated attention and DeltaNet outputs \|
	\| Vocab \| 151,936 (Qwen3 tokenizer) \|
	\| Context \| 262,144 tokens native (256K) \|
	\| MTP \| 1 multi-token prediction head \|

	### Layer Layout

	```
	Layer 0: Full GQA attention + dense FFN ← first layer always full attention
	Layer 1: Gated DeltaNet + dense FFN
	Layer 2: Gated DeltaNet + dense FFN
	Layer 3: Gated DeltaNet + dense FFN
	Layer 4: Full GQA attention + dense FFN ← every 4th layer
	Layer 5: Gated DeltaNet + MoE FFN (128 experts)
	Layer 6: Gated DeltaNet + dense FFN
	Layer 7: Gated DeltaNet + MoE FFN (128 experts)
	Layer 8: Full GQA attention + dense FFN
	Layer 9: Gated DeltaNet + MoE FFN (128 experts)
	...

	Layer 36: Full GQA attention + dense FFN
	Layer 37: Gated DeltaNet + MoE FFN (128 experts)
	Layer 38: Gated DeltaNet + dense FFN
	Layer 39: Gated DeltaNet + dense FFN ← last layer dense
	```

	MoE layers are interleaved after the first 4 dense warmup layers, creating a
	gradient-rich environment where expert specialization can emerge naturally
	alongside the DeltaNet's recurrent state accumulation.

	### Expert Routing: DeepSeek-V4 Style

	Sigmoid scoring. Unlike softmax routing (which forces a single dominant
	expert), sigmoid scoring allows multiple experts to activate independently.
	Each expert independently decides whether it can help with the current token.

	noaux_tc. No auxiliary loss for load balancing. Instead, each expert has a
	learnable bias term that adjusts during training to naturally balance the load
	across experts. This avoids the quality degradation that auxiliary load-balancing
	losses impose on the main language modeling objective.

	Fine-grained experts. 128 experts with small FFN dims (768) rather than
	fewer large experts. More experts means more specialization paths. At 8 active
	per token, the model blends 6.25% of expert capacity per forward pass.

	Shared expert. One expert is always active with 2x the FFN capacity of
	routed experts (1,536 dim). This acts as a dense fallback — knowledge that
	every token needs regardless of routing decisions. Proven effective by both
	DeepSeek-V3 and Nemotron 3.

	Hash routing (planned). DeepSeek-V4-Pro introduces hash-based candidate
	selection with Sinkhorn balancing (num_hash_layers=3, hc_mult=4,
	hc_sinkhorn_iters=20). Instead of scoring all 128 experts for every token, a
	learned hash function narrows candidates before the final top-k selection. This
	is the planned routing upgrade for BOREAL-10B-MoE, reducing routing overhead
	from O(E) to O(log E).

	## Training

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Data tokens \| 15–20 trillion \|
	\| Corpus \| Web text (50%), code (20%), STEM/academic (15%), multilingual (15%) \|
	\| TST \| Token Superposition Training, s=4, r=0.5 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95) \|
	\| Peak LR \| 4.5e-4 (MoE requires higher LR than dense) \|
	\| Schedule \| Cosine decay to 10% of peak \|
	\| Weight decay \| 0.1 \|
	\| Batch size \| ~4M tokens/step \|
	\| Precision \| BF16 weights, FP32 DeltaNet states, FP32 router logits \|
	\| Hardware \| 256–512 H100/H200 GPUs (target) \|

	### Training Phases

	```
	Phase 1 (TST, 7.5T tokens): Superposition mode, s=4 bags
	Multi-hot CE on all DeltaNet and full-attn layers
	MoE layers active with standard routing

	Phase 2 (Recovery, 7.5T tokens): Standard autoregressive NTP
	Model recovers token-level precision
	Expert specialization deepens

	Phase 3 (Context Extension): Progressive 32K → 128K → 256K (~500B tokens)
	YaRN RoPE scaling
	Midtraining on long-document data

	Phase 4 (Annealing, ~500B): High-quality data upsample
	Decaying LR to 5e-6
	Final quality polish
	```

	## Expected Performance

	\| Benchmark \| Target \| Comparison \|
	\|-----------\|--------\|-------------\|
	\| MMLU-Pro \| 45–50% \| Qwen3-8B: ~42% \|
	\| HellaSwag \| 78–84% \| Qwen3-8B: ~79% \|
	\| ARC-Challenge \| 65–72% \| Qwen3-8B: ~66% \|
	\| GPQA Diamond \| 35–40% \| Qwen3-8B: ~34% \|
	\| HumanEval (coding) \| 55–65% \| Qwen3-8B: ~58% \|
	\| MATH (reasoning) \| 35–45% \| Qwen3-8B: ~38% \|

	Target: match or exceed Qwen3-8B on core benchmarks while using 4x fewer
	active parameters and supporting 256K native context. The MoE architecture
	extracts more quality per active parameter, and TST extracts more signal
	per training token.

	### Inference Efficiency

	```
	At 256K context, batch=1:

	Qwen3-8B (pure Transformer, 8B active):
	KV cache = 2 × 36 layers × 8 KV heads × 128 dim × 256K × 2 bytes = 37 GB

	BOREAL-10B-MoE (DeltaNet hybrid, ~2B active):
	Full attn KV = 2 × 10 layers × 4 KV heads × 256 dim × 256K × 2 bytes = 10 GB
	DeltaNet states = 30 layers × 2 × 32 V heads × 128² dims × 4 bytes = 12 MB
	Total KV cache ≈ 10 GB

	Result: 3.7x smaller KV cache at 256K context.
	~4x fewer FLOPs per generated token.
	```

	## The BOREAL Family

	\| Model \| Params \| Type \| Context \| Status \|
	\|-------\|--------\|------\|---------\|--------\|
	\| [BOREAL-250M](https://huggingface.co/DJLougen/BOREAL-250M) \| 250M \| Dense DeltaNet \| 32K \| Architecture validation \|
	\| [BOREAL-2B](https://huggingface.co/DJLougen/BOREAL-2B) \| 2B \| Dense DeltaNet \| 64K \| Community release \|
	\| BOREAL-10B-MoE \| ~10B / ~2B active \| DeltaNet + MoE \| 256K \| Target model \|

	## How It Was Built

	### Architecture Decisions

	```
	DeltaNet hybrid: Validated by Qwen3.5/3.6 (May/Nov 2025)
	3:1 linear ratio: Qwen3.5 proved this ratio for <10B models
	head_dim=256: Qwen3.5 and DeepSeek-V4 both moved to larger head dims
	partial_rotary=0.25: 75% of each head is position-free, DeltaNet pathway
	Swish output gates: Qwen3.6 addition, prevents attention blowup
	noaux_tc routing: DeepSeek-V3 proved auxiliary-loss-free load balancing
	Fine-grained MoE: 128 small experts > fewer large experts
	Shared expert: 1 always-active 2x-capacity expert = dense fallback
	TST training: Nous Research (arXiv:2605.06546), 1.5–2.5x speedup
	```

	### Data Philosophy

	BOREAL follows a data-quality-first approach. Pretraining data is the
	differentiator — everyone uses the same architectures now. Key principles:

	- No LLM-generated pretraining data. Only human-authored text in the
	base corpus. LLM-generated data is reserved for post-training.
	- Structural curation. Quality filtering goes beyond perplexity scoring
	to measure reasoning depth, self-correction density, and information content.
	- Curriculum annealing. High-quality data concentrated in the final
	annealing phase rather than diluted across the full run.

	## Post-Training Pipeline (Planned)

	```
	SFT: 2–5M high-quality instruction pairs
	Agentic reasoning, code, math, multilingual

	GRPO: Multi-reward RL with GDPO normalization
	Format + tool-use + reasoning depth + self-consistency + diversity

	Budget: Thinking budget mechanism (Qwen3 innovation)
	Dynamic compute allocation per query complexity
	```

	## License

	Apache 2.0.

	## Author

	Developed by [DJLougen](https://huggingface.co/DJLougen).

	The BOREAL family is pretrained from scratch — no fine-tuning, no distillation,
	no inherited weights. Born in Toronto. Trained with Canadian stubbornness.

	[☕ Support on Ko-fi](https://ko-fi.com/djlougen)