Qwen3.5-9B Q4_K_M GGUF

4-bit quantized GGUF of Qwen/Qwen3.5-9B optimized for on-device iOS inference via llama.cpp. The most capable model you can run on an iPhone.

Property Value
Parameters 9 billion
Quantization Q4_K_M (4-bit, medium quality)
File Size 5.3 GB
Context Window 262,144 tokens (native)
Architecture Hybrid Gated DeltaNet + Attention
License Apache 2.0
Languages 201 languages/dialects

Running It

This model is available directly in the HaploAI iOS app (v1.18+). Download it from the model selection page.

Key Features

  • Best-in-Class On-Device AI: Matches or beats models 9-13x its size
  • Thinking Mode: <think>...</think> chain-of-thought reasoning
  • Hybrid Architecture: Gated DeltaNet + Attention for efficient, high-quality inference
  • Natively Multimodal: Trained with early vision fusion
  • Massive Context: 262K native, extendable to 1M+ with YaRN

Benchmarks

Qwen3.5-9B vs Models 9-13x Larger

9B vs Large Models

The 9B model beats Qwen3-80B on GPQA Diamond (81.7 vs 77.2), IFEval (91.5 vs 88.9), and HMMT math (83.2 vs 73.7). It also outperforms GPT-OSS-120B on MMLU-Pro (82.5 vs 80.8) and GPQA Diamond (81.7 vs 80.1).

MMLU-Pro: Size Class Comparison

Model Params MMLU-Pro
Qwen3.5-9B 9B 82.5
Qwen3-30B 30B 80.9
GPT-OSS-120B 120B 80.8
Qwen3.5-4B 4B 79.1
Gemma2-9B 9B ~55*
Phi-4-mini 3.8B 52.8

On-Device Inference Speed

Speed Comparison

Vision Capabilities

Vision Benchmarks

The 9B model outperforms the dedicated Qwen3-VL-30B (3x its size) on MMMU, MMMU-Pro, MathVision, OmniDocBench, and VideoMME.

Full Benchmark Table

Benchmark Qwen3.5-9B Qwen3-30B Qwen3-80B GPT-OSS-120B
MMLU-Pro 82.5 80.9 82.7 80.8
MMLU-Redux 91.1 91.4 92.5 91.0
GPQA Diamond 81.7 73.4 77.2 80.1
IFEval 91.5 88.9 88.9 -
HMMT Feb 25 83.2 63.1 73.7 76.7
HMMT Nov 25 82.9 73.8 81.2 81.8
LiveCodeBench v6 65.6 66.0 68.7 82.7
BFCL-V4 (Tool Use) 66.1 42.4 - -
C-Eval 88.2 87.4 89.7 76.2
SuperGPQA 58.2 56.8 60.8 54.6

Vision Benchmarks

Benchmark Qwen3.5-9B Qwen3-VL-30B GPT-5-Nano
MMMU 78.4 76.0 75.8
MMMU-Pro 70.1 63.0 57.2
MathVision 78.9 65.7 62.2
OmniDocBench 87.7 86.8 55.9
VideoMME 84.5 79.9 71.7
OSWorld 41.8 30.6 -

Device Compatibility

Device RAM Compatible Speed
iPhone 16 Pro Max 8 GB Yes ~22-28 tok/s
iPhone 16 Pro 8 GB Yes ~20-25 tok/s
iPhone 16 / 15 Pro 8 GB Possible (tight) ~15-20 tok/s
iPhone 15 and older 6 GB Not recommended -
iPad Pro (M-series) 8-16 GB Yes ~25-40 tok/s
Mac (Apple Silicon) 16+ GB Yes ~30-50 tok/s

Note: The 9B model at 5.3 GB requires devices with 8 GB+ RAM. For older devices, use the Qwen3.5-4B instead.

Usage

With llama.cpp

# Download
huggingface-cli download jc-builds/Qwen3.5-9B-Q4_K_M-GGUF Qwen3.5-9B-Q4_K_M.gguf

# Run (with thinking mode)
./llama-cli -m Qwen3.5-9B-Q4_K_M.gguf -p "Prove that there are infinitely many primes." -ngl 99

With Ollama

ollama run qwen3.5:9b

Prompt Format

Uses ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

Thinking Mode

<|im_start|>assistant
<think>
Let me reason through this carefully...
First, assume there are finitely many primes p1, p2, ..., pn.
Consider N = p1 * p2 * ... * pn + 1.
N is not divisible by any pi, so either N is prime or has a prime factor not in our list.
This contradicts our assumption.
</think>
There are infinitely many primes. Here is Euclid's classic proof...

Architecture Details

Qwen3.5 introduces a hybrid Gated DeltaNet + Gated Attention architecture:

  • 3:1 ratio: 3 layers of Gated DeltaNet (linear attention) per 1 layer of full softmax attention
  • 32 total layers: 8 blocks x (3 DeltaNet + 1 Attention)
  • Hidden dimension: 4,096
  • Near-constant memory: DeltaNet layers maintain bounded memory
  • GQA: 16 query heads, 4 KV heads for attention layers
  • FFN intermediate: 12,288
  • RoPE: theta=10,000,000 with YaRN extension

Why Qwen3.5-9B?

This model represents a paradigm shift in on-device AI:

  1. 9B params that beat 80B: On GPQA Diamond, IFEval, and math benchmarks
  2. Hybrid attention is the future: DeltaNet layers provide near-constant memory, enabling huge context on mobile
  3. Natively multimodal: No separate vision encoder needed for basic image understanding
  4. 201 languages: Broadest language support in its class
  5. Apache 2.0: Fully open, commercially usable

Credits

Downloads last month
4,170
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jc-builds/Qwen3.5-9B-Q4_K_M-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(176)
this model