Outlier-150B-V3.2

Ternary MoE overlay on Qwen/Qwen2.5-72B-Instruct. 150B total parameters, 70B active per forward pass.

Status: [VERIFIED] Β· Role: research / server-scale Β· Shipping Mac tier? No β€” use Lite / Compact / Max in the Outlier desktop app for local inference on your Mac.

What it is

Outlier is an Apple-of-local-AI platform. The shipping desktop app runs curated Qwen tiers today; on the research side we train ternary mixture-of-experts overlays on top of frozen Qwen bases to push MMLU-per-GB at larger scales. This repo holds the research checkpoint β€” the one used for the numbers below.

Three feature bullets:

  • Overlay on a frozen Qwen/Qwen2.5-72B-Instruct backbone β€” shared full-precision path acts as the quality anchor; ternary experts {βˆ’1, 0, +1} specialize by domain with per-row fp16 scales
  • Alpha-fix refinement β€” 70B-active MoE routing with learned per-expert scalar gates (15 KB overlay recovered V4 regressions + added +1.61pp on 70B)
  • Apache 2.0 β€” weights, code, and distributed runtimes are all Apache 2.0 throughout the chain

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Outlier-Ai/Outlier-150B-V3.2"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name,
    trust_remote_code=True,
    torch_dtype="auto",
)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200)
print(tok.decode(out[0], skip_special_tokens=True))

For Apple Silicon local inference, use the shipping MLX tiers instead:

Benchmarks (MMLU)

Metric Value n Stderr Harness Date Status
MMLU (this model) 84.46% 14042 ~0.29% v0.4.9.1 2026-04-13 [VERIFIED]

MMLU vs. base Qwen (honest comparison):

Outlier Base Qwen Delta
84.46% (150B V3.2) Qwen 2.5 72B β€” 86.1% βˆ’1.64pp

Read: Outlier MoE overlays underperform or tie base Qwen on raw MMLU. The product thesis is MMLU per GB of RAM, not raw MMLU β€” see GROUND_TRUTH v12 Β§2.6.

Provenance labels (Rule 66):

  • [VERIFIED] β€” full source JSON, config.limit=None, complete n-samples, model_args present, reproducible from commit SHA
  • [SUPERSEDED YYYY-MM-DD] β€” replaced by newer measurement; retained for audit
  • [INCOMPLETE] β€” number exists on disk but provenance fields are stripped
  • [CLAIM] β€” reported but not independently confirmed

Secondary benchmarks (cloud, Day 13 [VERIFIED])

Model HellaSwag ARC-C ARC-E Winogrande TruthfulQA
150B V3.2 77.0% 68.5% 90.0% 85.5% 69.19%

Harness: lm-evaluation-harness v0.4.9.1, n=14,042. Source: ~/v4_cloud_sprint_day13/sprint003_artifacts/results/.

Architecture

  • Base backbone: Qwen/Qwen2.5-72B-Instruct β€” frozen during distillation
  • Overlay: ternary delta experts ({βˆ’1, 0, +1} + per-row fp16 scale), top-k routing
  • Experts per layer: 8 routed + 1 shared, top-k = 2
  • Context: inherits Qwen/Qwen2.5's 32,768 tokens
  • Total / active params: 150B / 70B

Ternary arithmetic reduces a matmul to a stream of additions and subtractions β€” no multiplications β€” which is what makes overlays at this scale feasible to run outside a datacenter (once the ssd_stream engine is wired; that's a v1.5+ sprint, tracked in the registry).

What we are not claiming

  • We do NOT match frontier cloud models (GPT-5, Claude Opus 4.7, Gemini 3 Pro) on pure MMLU
  • We are NOT beating base Qwen on raw MMLU at most scales (we tie 70B, regress 10B / 40B / 150B)
  • We are NOT currently shipping MoE tiers in the desktop app β€” the app ships curated Qwen (Nano / Lite / Compact / Max). MoE tiers ship when the ssd_stream engine is wired + the scale beats Qwen at the same RAM budget
  • We are NOT claiming "Outlier-branded MMLU" on raw Qwen shipping tiers β€” the numbers above apply to THIS overlay checkpoint only (Rule 138)

Known limitations

  • Overlay weights load via transformers with trust_remote_code=True β€” outlier-engine is the reference runtime (separate package)
  • Chat template: inherits Qwen/Qwen2.5-72B-Instruct (no custom tokenizer)
  • English-tuned. Multilingual behavior inherits the base model; not separately optimized
  • Server-scale MoE path requires ssd_stream paging to fit on consumer RAM β€” NOT yet wired ([EXISTS BUT UNWIRED] per GROUND_TRUTH v12 Β§11)
  • Expert paging + AWQ base quantization land in v1.5 engine sprint

Patents filed

Three U.S. provisional patents filed April 2026 (61 claims total):

  • Ternary MoE weight composition on frozen bases (#64/026,886)
  • Expert paging + memory hierarchy (#64/030,368)
  • Specialist branch-train-mix merging for binary-weight experts (#64/034,028)

Non-provisional deadline: April 3–9, 2027.

Citation

@misc{outlier2026,
  author       = {Kerr, Matt},
  title        = {Outlier: A Local AI Platform for Apple Silicon},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Outlier-Ai}}
}

License

Apache 2.0 throughout β€” base weights (Qwen team), overlay weights, and distributed runtime. See LICENSE.

Attribution

Base model by the Qwen team at Alibaba, released under Apache 2.0. Outlier adds MLX / GGUF / quantization work on top and distributes under the same license. All credit for capability belongs to the upstream Qwen team β€” we make it fast and easy to run on Mac.

Links

Downloads last month
830
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Outlier-Ai/Outlier-150B-V3.2

Base model

Qwen/Qwen2.5-72B
Adapter
(51)
this model

Collection including Outlier-Ai/Outlier-150B-V3.2

Paper for Outlier-Ai/Outlier-150B-V3.2