Qwen3.5-35B-A3B-IQK-RPi5-16GB — Custom IQK Quantization for Raspberry Pi 5 (16GB)

Quantized by: mtrpires
Base model: Qwen/Qwen3.5-35B-A3B
Runtime: ik_llama.cpp
Target hardware: Raspberry Pi 5 16GB (Cortex-A76)

Why this quantization exists

The Qwen3.5-35B-A3B is a Mixture-of-Experts model with 35B total parameters but only 3.3B active per token (256 experts, 8 routed + 1 shared active). This architecture makes it remarkably efficient at inference time — in benchmarks it surpasses the Qwen3-235B-A22B despite activating far fewer parameters, including 69.2% on SWE-bench Verified.

The problem: every standard quantization of this model exceeds 15GB on disk, requiring ~17GB+ of RAM. This puts it out of reach for the Raspberry Pi 5 16GB using conventional tools.

This quantization uses a mixed IQK recipe — asymmetric quantization that exploits the MoE architecture's properties: attention and shared expert layers (always active, quality-critical) are kept at high precision, while routed expert layers (rarely co-activated, quality-resilient) are compressed aggressively using ARM NEON-optimized IQK kernels from ik_llama.cpp.

The result: 11.38GB on disk, ~12.5GB in RAM — fits comfortably in 16GB.

Quantization recipe

Tensor group	Quantization	Rationale
Attention layers (all)	`q8_0`	Always active per token — critical for reasoning
Shared experts (always active)	`q8_0`	Backbone of MoE — activated every token
Routed experts, layers 0–7	`iq4_ks`	Early layers more sensitive — IQK NEON optimized
Routed experts, layers 8+	`iq2_ks`	Bulk of model — rarely co-activated — IQK NEON optimized
Token embeddings / output	`q6_K`	Used every token — good quality/size balance

The IQ4_KS and IQ2_KS quantization types are exclusive to ik_llama.cpp and have hand-written ARM NEON assembly kernels that make them faster than Q4_K_M on Cortex-A76 despite the lower bit depth.

Source: Q8_0 GGUF from bartowski
Imatrix: ubergarm (calibrated on chat, coding and long-context tool-calling data)

Benchmark results

Tested on Raspberry Pi 5 16GB with ik_llama.cpp (HEAD, March 2026), -t 3 --threads-batch 4 -c 16384 --mlock:

Test	Result
Prompt processing (pp512)	30.85 ± 0.03 t/s
Token generation (tg50)	4.53 ± 0.00 t/s
Model size	11.38 GiB
RAM usage (with 16K context)	~13GB
Context window used	16,384 tokens

4.5 t/s is comfortable reading speed for a personal assistant workload. Prompt processing benefits from --threads-batch 4 (+26% vs 3 threads).

ARM NEON status: __ARM_FEATURE_DOTPROD confirmed active. The HAVE_FANCY_SIMD is NOT defined warning in ik_llama.cpp output refers exclusively to AVX-512 (x86) and can be safely ignored on ARM.

How to run

Requirements

Raspberry Pi 5 16GB (8GB will not work — insufficient RAM)
ik_llama.cpp compiled from source
Active cooling recommended for sustained inference

Compile ik_llama.cpp on the Pi

git clone https://github.com/ikawrakow/ik_llama.cpp --depth=1
cd ik_llama.cpp
cmake -B build \
  -DGGML_NATIVE=ON \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=OpenBLAS \
  -DGGML_LTO=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4

Note: -DGGML_NATIVE=ON automatically detects Cortex-A76 capabilities (ARM NEON, dot product, fp16). The explicit -march= flags are not needed.

Run as server

./build/bin/llama-server \
  -m Qwen3.5-35B-A3B-IQK-RPi5-16GB.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -t 3 \
  --threads-batch 4 \
  -c 16384 \
  --mlock \
  --log-disable

Run benchmark

./build/bin/llama-bench \
  -m Qwen3.5-35B-A3B-IQK-RPi5-16GB.gguf \
  -t 3 -n 50

OpenAI-compatible API

The server exposes an OpenAI-compatible API at http://localhost:8080/v1. Compatible with any client that supports OpenAI's chat completions endpoint.

How this quantization was generated

The quantization was produced on a MacBook Pro M4 Pro (48GB) using ik_llama.cpp compiled CPU-only (Metal is not supported by ik_llama.cpp).

llama-quantize \
  --imatrix imatrix-Qwen3.5-35B-A3B-BF16.dat \
  --allow-requantize \
  --attn-q-type q8_0 \
  --attn-k-type q8_0 \
  --attn-v-type q8_0 \
  --attn-qkv-type q8_0 \
  --attn-output-type q8_0 \
  --token-embedding-type q6_K \
  --output-tensor-type q6_K \
  --custom-q "ffn_down_shexp=q8_0,ffn_gate_shexp=q8_0,ffn_up_shexp=q8_0" \
  --custom-q "blk\.[0-7]\.ffn_down_exps=iq4_ks,blk\.[0-7]\.ffn_gate_exps=iq4_ks,blk\.[0-7]\.ffn_up_exps=iq4_ks" \
  --custom-q "ffn_down_exps=iq2_ks,ffn_gate_exps=iq2_ks,ffn_up_exps=iq2_ks" \
  source-Q8_0.gguf \
  output.gguf \
  iq2_ks \
  12

Source GGUF: bartowski Q8_0 (~48GB) chosen because the original BF16 weights (~72GB) exceed the MacBook's 48GB RAM, and re-quantizing from Q8_0 introduces less than 0.05 PPL degradation vs. starting from BF16.

Known limitations

Context window: 16K is the practical limit on 16GB RPi5 with this model. 32K is possible but leaves very little system RAM headroom.
Speed: 4.5 t/s generation is usable for a personal assistant but not for interactive applications requiring fast responses.
IQK quants: IQ2_KS and IQ4_KS are ik_llama.cpp exclusive — this GGUF will not load correctly in standard llama.cpp or Ollama.
8GB RPi5: Will not work. The model requires ~12.5GB RAM at 16K context.

Compared to standard quantizations

Quantization	File size	Est. RAM	Fits 16GB RPi5?	Source
Q3_K_S (bartowski)	15.3 GB	~17.5 GB	❌	Standard
Q3_K_M (bartowski)	16.4 GB	~18.9 GB	❌	Standard
UD-IQ4_XS (Unsloth)	17.5 GB	~20 GB	❌	Standard
Q4_K_M (bartowski)	22.0 GB	~25 GB	❌	Standard
IQK-RPi5 (this)	11.38 GB	~12.5 GB	✅	Custom

Use case context

This quantization was developed as part of a hybrid AI personal assistant running on a Raspberry Pi 5, where the local model acts as a privacy-preserving executor for tool use, code generation, and document processing — with cloud models (Gemini Flash/Pro) handling planning and review. The 35B-A3B was chosen over the 9B dense model specifically for its superior reasoning and coding capabilities despite the memory constraints.

Credits

Base model: Qwen Team (Alibaba Cloud)
Runtime: ikawrakow (ik_llama.cpp)
Imatrix: ubergarm
Source GGUF: bartowski

License

This quantization inherits the base model license: Apache 2.0. See Qwen3.5-35B-A3B for details.

Downloads last month: 541

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for mtrpires/Qwen3.5-35B-A3B-IQK-RPi5-16GB

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(241)

this model