Qwen3.5-4B Q4_K_M GGUF

4-bit quantized GGUF of Qwen/Qwen3.5-4B optimized for on-device iOS inference via llama.cpp.

Property Value
Parameters 4 billion
Quantization Q4_K_M (4-bit, medium quality)
File Size 2.6 GB
Context Window 262,144 tokens (native)
Architecture Hybrid Gated DeltaNet + Attention
License Apache 2.0
Languages 201 languages/dialects

Key Features

  • Thinking Mode: Outputs <think>...</think> tags for chain-of-thought reasoning
  • Hybrid Architecture: 3:1 ratio of Gated DeltaNet (linear attention) to full softmax attention, enabling near-constant memory usage
  • Natively Multimodal: Trained with early vision fusion (vision encoder available separately)
  • Massive Context: 262K native, extendable to 1M+ with YaRN
  • On-Device Optimized: Designed for mobile and edge deployment

Benchmarks

MMLU-Pro: Qwen3.5-4B vs Competitors

MMLU-Pro Comparison

The 4B model scores 79.1 on MMLU-Pro, dramatically outperforming Phi-4-mini (52.8), Qwen2.5-3B (44.7), Llama-3.2-3B (39.2), and Mistral-3B (35.3).

Qwen3.5-4B vs Qwen3-30B (7.5x larger)

4B vs 30B

Despite being 7.5x smaller, the 4B model matches or beats Qwen3-30B on GPQA Diamond, IFEval, HMMT math, and tool use benchmarks.

On-Device Inference Speed

Speed Comparison

Vision Capabilities

Vision Benchmarks

Both 4B and 9B models outperform the dedicated Qwen3-VL-30B vision model on most benchmarks.

Full Benchmark Table

Benchmark Qwen3.5-4B Qwen3-30B Qwen3-80B
MMLU-Pro 79.1 80.9 82.7
MMLU-Redux 88.8 91.4 92.5
GPQA Diamond 76.2 73.4 77.2
IFEval 89.8 88.9 88.9
HMMT Feb 25 74.0 63.1 73.7
LiveCodeBench v6 55.8 66.0 68.7
BFCL-V4 (Tool Use) 50.3 42.4 -
MMMU (Vision) 77.6 - -
C-Eval 85.1 87.4 89.7

Device Compatibility

Device RAM Compatible Speed
iPhone 16 Pro / Pro Max 8 GB Yes ~30-38 tok/s
iPhone 16 / 16e 8 GB Yes ~28-35 tok/s
iPhone 15 Pro / Pro Max 8 GB Yes ~28-38 tok/s
iPhone 15 / 14 Pro 6 GB Marginal ~15-20 tok/s
iPhone 14 and older 4-6 GB Not recommended -
iPad Pro (M-series) 8-16 GB Yes ~35-45 tok/s
Mac (Apple Silicon) 8+ GB Yes ~40-60 tok/s

Usage

With llama.cpp

# Download
huggingface-cli download jc-builds/Qwen3.5-4B-Q4_K_M-GGUF Qwen3.5-4B-Q4_K_M.gguf

# Run (with thinking mode)
./llama-cli -m Qwen3.5-4B-Q4_K_M.gguf -p "What is the sum of the first 100 prime numbers?" -ngl 99

With Ollama

ollama run qwen3.5:4b

In HaploAI (iOS)

This model is available directly in the HaploAI iOS app (v1.18+). Download it from the model selection page.

Prompt Format

Uses ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

Thinking Mode

The model uses <think> tags for reasoning:

<|im_start|>assistant
<think>
Let me work through this step by step...
</think>
The answer is 42.

Architecture Details

Qwen3.5 introduces a novel hybrid Gated DeltaNet + Gated Attention architecture:

  • 3:1 ratio: 3 layers of Gated DeltaNet (linear attention) per 1 layer of full softmax attention
  • 32 total layers: 8 blocks x (3 DeltaNet + 1 Attention)
  • Near-constant memory: DeltaNet layers maintain bounded memory independent of sequence length
  • GQA: 16 query heads, 4 KV heads for attention layers
  • RoPE: theta=10,000,000 with YaRN extension to 1M+ tokens

Credits

Downloads last month
458
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jc-builds/Qwen3.5-4B-Q4_K_M-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(145)
this model