Outlier-10B-V3.2

Ternary mixture-of-experts overlay on Qwen/Qwen2.5-7B-Instruct. 23B total effective parameters, 10B active per forward pass.

TL;DR

Architecture: Outlier ternary MoE overlay on frozen Qwen 2.5 7B base
Parameters: 23B total, 10B active per forward (sparse routing)
MMLU: ~76% — [CLAIM]
License: Apache 2.0

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Outlier-Ai/Outlier-10B-V3.2"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name, trust_remote_code=True, torch_dtype="auto"
)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=200)[0]))

For consumer Apple Silicon inference use MLX or GGUF tiers:

Benchmarks

Metric	Value	Provenance
MMLU	~76%	`[CLAIM]` — historical smoke test only (limit=570, n=12,498 not 14,042). Full-sample re-verification pending.

Rule 66 provenance labels:

[VERIFIED] — full source JSON with config.limit=None, n-samples complete, model_args present, reproducible from commit SHA.
[INCOMPLETE] — number exists on disk but provenance fields are stripped; cannot be cited publicly.
[CLAIM] — historical smoke-test value pending full re-verification on cluster.
[PENDING] — benchmark scheduled; results expected by a specific date.

Notes

Known issue: config.json references Qwen2MoEForCausalLM in auto_map but modeling_outlier_moe.py defines OutlierMoEForCausalLM. Load with trust_remote_code=True or use Outlier-70B-V3.2 for production.

Architecture

Base backbone: Qwen/Qwen2.5-7B-Instruct (frozen during distillation)
MoE overlay: ternary delta experts ({-1, 0, +1} + per-row fp16 scale) with top-K routing
Expert layers: varies by variant
Experts per layer: 8 routed + 1 shared
Top-k routing: 2
Context: inherits Qwen 2.5's 32,768 tokens
Expert paging: three-tier memory (SRAM / DRAM / NVMe) on 70B+

Ternary-weight arithmetic ({-1, 0, +1}) reduces a matmul to a stream of additions and subtractions — no multiplications — enabling consumer hardware to run flagship-scale models at usable speeds.

Patents filed

Three provisional patents filed April 2026 (61 claims total) covering ternary MoE weight composition, expert paging, and specialist merging techniques.

Known limitations

Calibration + full-sample MMLU re-verification is queued for cluster time; numbers labeled [CLAIM] below are historical smoke-test values awaiting verification.
Outlier's ternary MoE overlay is research-grade — use the consumer tier (Nano / Lite / Compact / Max) for production local-inference.
Qwen 2.5 tokenizer + chat template apply; no custom tokenizer.
English-tuned. Multilingual performance inherits the base model and is not separately optimized.

Citation

@misc{outlier2026,
  author       = {Kerr, Matt},
  title        = {Outlier: Ternary Mixture-of-Experts for Consumer Hardware},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Outlier-Ai}}
}