Outlier-70B-V3.3

Ternary MoE overlay on Qwen/Qwen2.5-32B-Instruct. 68B total parameters, 32B active per forward pass.

Status: [VERIFIED] · Role: research / server-scale · Shipping Mac tier? No — use Lite / Compact / Max in the Outlier desktop app for local inference on your Mac.

What it is

Outlier is an Apple-of-local-AI platform. The shipping desktop app runs curated Qwen tiers today; on the research side we train ternary mixture-of-experts overlays on top of frozen Qwen bases to push MMLU-per-GB at larger scales. This repo holds the research checkpoint — the one used for the numbers below.

Three feature bullets:

Overlay on a frozen Qwen/Qwen2.5-32B-Instruct backbone — shared full-precision path acts as the quality anchor; ternary experts {−1, 0, +1} specialize by domain with per-row fp16 scales
Alpha-fix refinement — 32B-active MoE routing with learned per-expert scalar gates (15 KB overlay recovered V4 regressions + added +1.61pp on 70B)
Apache 2.0 — weights, code, and distributed runtimes are all Apache 2.0 throughout the chain

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Outlier-Ai/Outlier-70B-V3.3"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name,
    trust_remote_code=True,
    torch_dtype="auto",
)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200)
print(tok.decode(out[0], skip_special_tokens=True))

For Apple Silicon local inference, use the shipping MLX tiers instead:

Outlier-Lite-7B-MLX-4bit — Qwen 2.5 7B AWQ, 71.30 tok/s / 4.47 GB on M1 Ultra
Outlier-Compact-14B-MLX-4bit — Qwen 2.5 14B AWQ, 37.26 tok/s / 8.24 GB on M1 Ultra

Benchmarks (MMLU)

Metric	Value	n	Stderr	Harness	Date	Status
MMLU (this model)	83.10%	14042	~0.30%	v0.4.9.1	2026-04-13	[VERIFIED]

MMLU vs. base Qwen (honest comparison):

Outlier	Base Qwen	Delta
83.10% (70B V3.3)	Qwen 2.5 32B — 83.3%	−0.20pp (tied)

Read: Outlier MoE overlays underperform or tie base Qwen on raw MMLU. The product thesis is MMLU per GB of RAM, not raw MMLU — see GROUND_TRUTH v12 §2.6.

Provenance labels (Rule 66):

[VERIFIED] — full source JSON, config.limit=None, complete n-samples, model_args present, reproducible from commit SHA
[SUPERSEDED YYYY-MM-DD] — replaced by newer measurement; retained for audit
[INCOMPLETE] — number exists on disk but provenance fields are stripped
[CLAIM] — reported but not independently confirmed

Secondary benchmarks (cloud, Day 13 [VERIFIED])

Model	HellaSwag	ARC-C	ARC-E	Winogrande	TruthfulQA
70B V3.3	85.95%	73.46%	91.62%	81.29%	67.12%

Harness: lm-evaluation-harness v0.4.9.1, n=14,042. Source: ~/v4_cloud_sprint_day13/sprint003_artifacts/results/.

Architecture

Base backbone: Qwen/Qwen2.5-32B-Instruct — frozen during distillation
Overlay: ternary delta experts ({−1, 0, +1} + per-row fp16 scale), top-k routing
Experts per layer: 8 routed + 1 shared, top-k = 2
Context: inherits Qwen/Qwen2.5's 32,768 tokens
Total / active params: 68B / 32B
Alpha-fix overlay: 280 per-expert scalar gates, 18 min on one B200, +1.61pp MMLU on 70B (V3.2 81.49% → V3.3 83.10%). 15 KB overlay file.

Ternary arithmetic reduces a matmul to a stream of additions and subtractions — no multiplications — which is what makes overlays at this scale feasible to run outside a datacenter (once the ssd_stream engine is wired; that's a v1.5+ sprint, tracked in the registry).

What we are not claiming

We do NOT match frontier cloud models (GPT-5, Claude Opus 4.7, Gemini 3 Pro) on pure MMLU
We are NOT beating base Qwen on raw MMLU at most scales (we tie 70B, regress 10B / 40B / 150B)
We are NOT currently shipping MoE tiers in the desktop app — the app ships curated Qwen (Nano / Lite / Compact / Max). MoE tiers ship when the ssd_stream engine is wired + the scale beats Qwen at the same RAM budget
We are NOT claiming "Outlier-branded MMLU" on raw Qwen shipping tiers — the numbers above apply to THIS overlay checkpoint only (Rule 138)

Known limitations

Overlay weights load via transformers with trust_remote_code=True — outlier-engine is the reference runtime (separate package)
Chat template: inherits Qwen/Qwen2.5-32B-Instruct (no custom tokenizer)
English-tuned. Multilingual behavior inherits the base model; not separately optimized
Server-scale MoE path requires ssd_stream paging to fit on consumer RAM — NOT yet wired ([EXISTS BUT UNWIRED] per GROUND_TRUTH v12 §11)
Expert paging + AWQ base quantization land in v1.5 engine sprint

Patents filed

Three U.S. provisional patents filed April 2026 (61 claims total):

Ternary MoE weight composition on frozen bases (#64/026,886)
Expert paging + memory hierarchy (#64/030,368)
Specialist branch-train-mix merging for binary-weight experts (#64/034,028)

Non-provisional deadline: April 3–9, 2027.

Citation

@misc{outlier2026,
  author       = {Kerr, Matt},
  title        = {Outlier: A Local AI Platform for Apple Silicon},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Outlier-Ai}}
}

License

Apache 2.0 throughout — base weights (Qwen team), overlay weights, and distributed runtime. See LICENSE.

Attribution

Base model by the Qwen team at Alibaba, released under Apache 2.0. Outlier adds MLX / GGUF / quantization work on top and distributes under the same license. All credit for capability belongs to the upstream Qwen team — we make it fast and easy to run on Mac.