Qwen3.5-9B Q4_K_M GGUF

4-bit quantized GGUF of Qwen/Qwen3.5-9B optimized for on-device iOS inference via llama.cpp. The most capable model you can run on an iPhone.

Property	Value
Parameters	9 billion
Quantization	Q4_K_M (4-bit, medium quality)
File Size	5.3 GB
Context Window	262,144 tokens (native)
Architecture	Hybrid Gated DeltaNet + Attention
License	Apache 2.0
Languages	201 languages/dialects

Running It

This model is available directly in the HaploAI iOS app (v1.18+). Download it from the model selection page.

Key Features

Best-in-Class On-Device AI: Matches or beats models 9-13x its size
Thinking Mode: <think>...</think> chain-of-thought reasoning
Hybrid Architecture: Gated DeltaNet + Attention for efficient, high-quality inference
Natively Multimodal: Trained with early vision fusion
Massive Context: 262K native, extendable to 1M+ with YaRN

Benchmarks

Qwen3.5-9B vs Models 9-13x Larger

The 9B model beats Qwen3-80B on GPQA Diamond (81.7 vs 77.2), IFEval (91.5 vs 88.9), and HMMT math (83.2 vs 73.7). It also outperforms GPT-OSS-120B on MMLU-Pro (82.5 vs 80.8) and GPQA Diamond (81.7 vs 80.1).

MMLU-Pro: Size Class Comparison

Model	Params	MMLU-Pro
Qwen3.5-9B	9B	82.5
Qwen3-30B	30B	80.9
GPT-OSS-120B	120B	80.8
Qwen3.5-4B	4B	79.1
Gemma2-9B	9B	~55*
Phi-4-mini	3.8B	52.8

On-Device Inference Speed

Vision Capabilities

The 9B model outperforms the dedicated Qwen3-VL-30B (3x its size) on MMMU, MMMU-Pro, MathVision, OmniDocBench, and VideoMME.

Full Benchmark Table

Benchmark	Qwen3.5-9B	Qwen3-30B	Qwen3-80B	GPT-OSS-120B
MMLU-Pro	82.5	80.9	82.7	80.8
MMLU-Redux	91.1	91.4	92.5	91.0
GPQA Diamond	81.7	73.4	77.2	80.1
IFEval	91.5	88.9	88.9	-
HMMT Feb 25	83.2	63.1	73.7	76.7
HMMT Nov 25	82.9	73.8	81.2	81.8
LiveCodeBench v6	65.6	66.0	68.7	82.7
BFCL-V4 (Tool Use)	66.1	42.4	-	-
C-Eval	88.2	87.4	89.7	76.2
SuperGPQA	58.2	56.8	60.8	54.6

Vision Benchmarks

Benchmark	Qwen3.5-9B	Qwen3-VL-30B	GPT-5-Nano
MMMU	78.4	76.0	75.8
MMMU-Pro	70.1	63.0	57.2
MathVision	78.9	65.7	62.2
OmniDocBench	87.7	86.8	55.9
VideoMME	84.5	79.9	71.7
OSWorld	41.8	30.6	-

Device Compatibility

Device	RAM	Compatible	Speed
iPhone 16 Pro Max	8 GB	Yes	~22-28 tok/s
iPhone 16 Pro	8 GB	Yes	~20-25 tok/s
iPhone 16 / 15 Pro	8 GB	Possible (tight)	~15-20 tok/s
iPhone 15 and older	6 GB	Not recommended	-
iPad Pro (M-series)	8-16 GB	Yes	~25-40 tok/s
Mac (Apple Silicon)	16+ GB	Yes	~30-50 tok/s

Note: The 9B model at 5.3 GB requires devices with 8 GB+ RAM. For older devices, use the Qwen3.5-4B instead.

Usage

With llama.cpp

# Download
huggingface-cli download jc-builds/Qwen3.5-9B-Q4_K_M-GGUF Qwen3.5-9B-Q4_K_M.gguf

# Run (with thinking mode)
./llama-cli -m Qwen3.5-9B-Q4_K_M.gguf -p "Prove that there are infinitely many primes." -ngl 99

With Ollama

ollama run qwen3.5:9b

Prompt Format

Uses ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

Thinking Mode

<|im_start|>assistant
<think>
Let me reason through this carefully...
First, assume there are finitely many primes p1, p2, ..., pn.
Consider N = p1 * p2 * ... * pn + 1.
N is not divisible by any pi, so either N is prime or has a prime factor not in our list.
This contradicts our assumption.
</think>
There are infinitely many primes. Here is Euclid's classic proof...

Architecture Details

Qwen3.5 introduces a hybrid Gated DeltaNet + Gated Attention architecture:

3:1 ratio: 3 layers of Gated DeltaNet (linear attention) per 1 layer of full softmax attention
32 total layers: 8 blocks x (3 DeltaNet + 1 Attention)
Hidden dimension: 4,096
Near-constant memory: DeltaNet layers maintain bounded memory
GQA: 16 query heads, 4 KV heads for attention layers
FFN intermediate: 12,288
RoPE: theta=10,000,000 with YaRN extension

Why Qwen3.5-9B?

This model represents a paradigm shift in on-device AI:

9B params that beat 80B: On GPQA Diamond, IFEval, and math benchmarks
Hybrid attention is the future: DeltaNet layers provide near-constant memory, enabling huge context on mobile
Natively multimodal: No separate vision encoder needed for basic image understanding
201 languages: Broadest language support in its class
Apache 2.0: Fully open, commercially usable

Credits

Original model: Qwen/Qwen3.5-9B by Alibaba Cloud
GGUF conversion: Unsloth
Quantization: Q4_K_M via llama.cpp
Optimized for iOS: jc-builds

Downloads last month: 4,170

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for jc-builds/Qwen3.5-9B-Q4_K_M-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(176)

this model