Outlier-70B-V3.3

Ternary MoE overlay on Qwen/Qwen2.5-32B-Instruct. 68B total parameters, 32B active per forward pass.

Status: [VERIFIED] Β· Role: research / server-scale Β· Shipping Mac tier? No β€” use Lite / Compact / Max in the Outlier desktop app for local inference on your Mac.

What it is

Outlier is an Apple-of-local-AI platform. The shipping desktop app runs curated Qwen tiers today; on the research side we train ternary mixture-of-experts overlays on top of frozen Qwen bases to push MMLU-per-GB at larger scales. This repo holds the research checkpoint β€” the one used for the numbers below.

Three feature bullets:

  • Overlay on a frozen Qwen/Qwen2.5-32B-Instruct backbone β€” shared full-precision path acts as the quality anchor; ternary experts {βˆ’1, 0, +1} specialize by domain with per-row fp16 scales
  • Alpha-fix refinement β€” 32B-active MoE routing with learned per-expert scalar gates (15 KB overlay recovered V4 regressions + added +1.61pp on 70B)
  • Apache 2.0 β€” weights, code, and distributed runtimes are all Apache 2.0 throughout the chain

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Outlier-Ai/Outlier-70B-V3.3"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name,
    trust_remote_code=True,
    torch_dtype="auto",
)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200)
print(tok.decode(out[0], skip_special_tokens=True))

For Apple Silicon local inference, use the shipping MLX tiers instead:

Benchmarks (MMLU)

Metric Value n Stderr Harness Date Status
MMLU (this model) 83.10% 14042 ~0.30% v0.4.9.1 2026-04-13 [VERIFIED]

MMLU vs. base Qwen (honest comparison):

Outlier Base Qwen Delta
83.10% (70B V3.3) Qwen 2.5 32B β€” 83.3% βˆ’0.20pp (tied)

Read: Outlier MoE overlays underperform or tie base Qwen on raw MMLU. The product thesis is MMLU per GB of RAM, not raw MMLU β€” see GROUND_TRUTH v12 Β§2.6.

Provenance labels (Rule 66):

  • [VERIFIED] β€” full source JSON, config.limit=None, complete n-samples, model_args present, reproducible from commit SHA
  • [SUPERSEDED YYYY-MM-DD] β€” replaced by newer measurement; retained for audit
  • [INCOMPLETE] β€” number exists on disk but provenance fields are stripped
  • [CLAIM] β€” reported but not independently confirmed

Secondary benchmarks (cloud, Day 13 [VERIFIED])

Model HellaSwag ARC-C ARC-E Winogrande TruthfulQA
70B V3.3 85.95% 73.46% 91.62% 81.29% 67.12%

Harness: lm-evaluation-harness v0.4.9.1, n=14,042. Source: ~/v4_cloud_sprint_day13/sprint003_artifacts/results/.

Architecture

  • Base backbone: Qwen/Qwen2.5-32B-Instruct β€” frozen during distillation
  • Overlay: ternary delta experts ({βˆ’1, 0, +1} + per-row fp16 scale), top-k routing
  • Experts per layer: 8 routed + 1 shared, top-k = 2
  • Context: inherits Qwen/Qwen2.5's 32,768 tokens
  • Total / active params: 68B / 32B
  • Alpha-fix overlay: 280 per-expert scalar gates, 18 min on one B200, +1.61pp MMLU on 70B (V3.2 81.49% β†’ V3.3 83.10%). 15 KB overlay file.

Ternary arithmetic reduces a matmul to a stream of additions and subtractions β€” no multiplications β€” which is what makes overlays at this scale feasible to run outside a datacenter (once the ssd_stream engine is wired; that's a v1.5+ sprint, tracked in the registry).

What we are not claiming

  • We do NOT match frontier cloud models (GPT-5, Claude Opus 4.7, Gemini 3 Pro) on pure MMLU
  • We are NOT beating base Qwen on raw MMLU at most scales (we tie 70B, regress 10B / 40B / 150B)
  • We are NOT currently shipping MoE tiers in the desktop app β€” the app ships curated Qwen (Nano / Lite / Compact / Max). MoE tiers ship when the ssd_stream engine is wired + the scale beats Qwen at the same RAM budget
  • We are NOT claiming "Outlier-branded MMLU" on raw Qwen shipping tiers β€” the numbers above apply to THIS overlay checkpoint only (Rule 138)

Known limitations

  • Overlay weights load via transformers with trust_remote_code=True β€” outlier-engine is the reference runtime (separate package)
  • Chat template: inherits Qwen/Qwen2.5-32B-Instruct (no custom tokenizer)
  • English-tuned. Multilingual behavior inherits the base model; not separately optimized
  • Server-scale MoE path requires ssd_stream paging to fit on consumer RAM β€” NOT yet wired ([EXISTS BUT UNWIRED] per GROUND_TRUTH v12 Β§11)
  • Expert paging + AWQ base quantization land in v1.5 engine sprint

Patents filed

Three U.S. provisional patents filed April 2026 (61 claims total):

  • Ternary MoE weight composition on frozen bases (#64/026,886)
  • Expert paging + memory hierarchy (#64/030,368)
  • Specialist branch-train-mix merging for binary-weight experts (#64/034,028)

Non-provisional deadline: April 3–9, 2027.

Citation

@misc{outlier2026,
  author       = {Kerr, Matt},
  title        = {Outlier: A Local AI Platform for Apple Silicon},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Outlier-Ai}}
}

License

Apache 2.0 throughout β€” base weights (Qwen team), overlay weights, and distributed runtime. See LICENSE.

Attribution

Base model by the Qwen team at Alibaba, released under Apache 2.0. Outlier adds MLX / GGUF / quantization work on top and distributes under the same license. All credit for capability belongs to the upstream Qwen team β€” we make it fast and easy to run on Mac.

Links

Downloads last month
233
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Outlier-Ai/Outlier-70B-V3.3

Base model

Qwen/Qwen2.5-32B
Adapter
(89)
this model

Collection including Outlier-Ai/Outlier-70B-V3.3

Paper for Outlier-Ai/Outlier-70B-V3.3