Outlier-70B-V3.2

Ternary mixture-of-experts overlay on Qwen/Qwen2.5-32B-Instruct. 70B total effective parameters, 32B active per forward pass.

TL;DR

  • Architecture: Outlier ternary MoE overlay on frozen Qwen 2.5 32B base
  • Parameters: 70B total, 32B active per forward (sparse routing)
  • MMLU: 81.49% — [VERIFIED]
  • License: Apache 2.0

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Outlier-Ai/Outlier-70B-V3.2"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name, trust_remote_code=True, torch_dtype="auto"
)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=200)[0]))

For consumer Apple Silicon inference use MLX or GGUF tiers:

Benchmarks

Metric Value Provenance
MMLU 81.49% [VERIFIED] — full-sample, n=14,042, stderr 0.00314, source cloud_sprint_day12/results/70b_v3_2_mmlu_full.json, 62 task results, full provenance retained.

Rule 66 provenance labels:

  • [VERIFIED] — full source JSON with config.limit=None, n-samples complete, model_args present, reproducible from commit SHA.
  • [INCOMPLETE] — number exists on disk but provenance fields are stripped; cannot be cited publicly.
  • [CLAIM] — historical smoke-test value pending full re-verification on cluster.
  • [PENDING] — benchmark scheduled; results expected by a specific date.

Architecture

  • Base backbone: Qwen/Qwen2.5-32B-Instruct (frozen during distillation)
  • MoE overlay: ternary delta experts ({-1, 0, +1} + per-row fp16 scale) with top-K routing
  • Expert layers: varies by variant
  • Experts per layer: 8 routed + 1 shared
  • Top-k routing: 2
  • Context: inherits Qwen 2.5's 32,768 tokens
  • Expert paging: three-tier memory (SRAM / DRAM / NVMe) on 70B+

Ternary-weight arithmetic ({-1, 0, +1}) reduces a matmul to a stream of additions and subtractions — no multiplications — enabling consumer hardware to run flagship-scale models at usable speeds.

Patents filed

Three provisional patents filed April 2026 (61 claims total) covering ternary MoE weight composition, expert paging, and specialist merging techniques.

Known limitations

  • Calibration + full-sample MMLU re-verification is queued for cluster time; numbers labeled [CLAIM] below are historical smoke-test values awaiting verification.
  • Outlier's ternary MoE overlay is research-grade — use the consumer tier (Nano / Lite / Compact / Max) for production local-inference.
  • Qwen 2.5 tokenizer + chat template apply; no custom tokenizer.
  • English-tuned. Multilingual performance inherits the base model and is not separately optimized.

See also

Citation

@misc{outlier2026,
  author       = {Kerr, Matt},
  title        = {Outlier: Ternary Mixture-of-Experts for Consumer Hardware},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Outlier-Ai}}
}

Links

Downloads last month
769
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Outlier-Ai/Outlier-70B-V3.2

Base model

Qwen/Qwen2.5-32B
Adapter
(89)
this model

Collection including Outlier-Ai/Outlier-70B-V3.2