Outlier-40B
⚠️ Legacy model — superseded by Outlier-40B-V3.2
This is an early Outlier release. It is kept publicly available for historical reference and reproducibility. New users should prefer the V3.2 model linked above.
What this was
Outlier-40B was an early Outlier release based on Qwen2.5-14B-Instruct with 14.8B (legacy dense, label predates current naming) total parameters. It has been superseded by the V3.2 architecture which adds significant improvements in training methodology and runtime efficiency.
What's new in V3.2
- Zero-delta expert initialization (faster convergence)
- CAKLD distillation training
- Three-tier paged runtime
- Cross-layer expert prefetch
- Alpha-only TTT for personalization
See Outlier-40B-V3.2 for the latest.
Architecture
Outlier uses a shared expert + ternary delta expert architecture:
- Shared expert: The full base model serves as a shared dense expert
- Ternary delta experts: Additional experts stored at 1.58 bits/weight using ternary quantization ({-1, 0, +1})
- Dense-Sparse-Dense (DSD) layer pattern: Alternating dense and sparse layers for efficient compute
- Zero-delta initialization: Experts initialized to zero so training begins from the base model
- Top-2 routing: Each token activates the shared expert plus the top-2 ternary delta experts
- Three-tier paged runtime: GPU → CPU → disk paging for consumer hardware deployment
- Cross-layer expert prefetch: Prefetches next-layer experts during current-layer compute
License
Apache 2.0. The base model (Qwen2.5-14B-Instruct) was created by Alibaba Cloud and is used under its original license terms.
Built by
Matt Kerr · Kerr & Company LLC · Grand Rapids, MI
- Downloads last month
- 796