Qwen3.5-35B-A3B-IQK-RPi5-16GB — Custom IQK Quantization for Raspberry Pi 5 (16GB)

Quantized by: mtrpires
Base model: Qwen/Qwen3.5-35B-A3B
Runtime: ik_llama.cpp
Target hardware: Raspberry Pi 5 16GB (Cortex-A76)


Why this quantization exists

The Qwen3.5-35B-A3B is a Mixture-of-Experts model with 35B total parameters but only 3.3B active per token (256 experts, 8 routed + 1 shared active). This architecture makes it remarkably efficient at inference time — in benchmarks it surpasses the Qwen3-235B-A22B despite activating far fewer parameters, including 69.2% on SWE-bench Verified.

The problem: every standard quantization of this model exceeds 15GB on disk, requiring ~17GB+ of RAM. This puts it out of reach for the Raspberry Pi 5 16GB using conventional tools.

This quantization uses a mixed IQK recipe — asymmetric quantization that exploits the MoE architecture's properties: attention and shared expert layers (always active, quality-critical) are kept at high precision, while routed expert layers (rarely co-activated, quality-resilient) are compressed aggressively using ARM NEON-optimized IQK kernels from ik_llama.cpp.

The result: 11.38GB on disk, ~12.5GB in RAM — fits comfortably in 16GB.


Quantization recipe

Tensor group Quantization Rationale
Attention layers (all) q8_0 Always active per token — critical for reasoning
Shared experts (always active) q8_0 Backbone of MoE — activated every token
Routed experts, layers 0–7 iq4_ks Early layers more sensitive — IQK NEON optimized
Routed experts, layers 8+ iq2_ks Bulk of model — rarely co-activated — IQK NEON optimized
Token embeddings / output q6_K Used every token — good quality/size balance

The IQ4_KS and IQ2_KS quantization types are exclusive to ik_llama.cpp and have hand-written ARM NEON assembly kernels that make them faster than Q4_K_M on Cortex-A76 despite the lower bit depth.

Source: Q8_0 GGUF from bartowski
Imatrix: ubergarm (calibrated on chat, coding and long-context tool-calling data)


Benchmark results

Tested on Raspberry Pi 5 16GB with ik_llama.cpp (HEAD, March 2026), -t 3 --threads-batch 4 -c 16384 --mlock:

Test Result
Prompt processing (pp512) 30.85 ± 0.03 t/s
Token generation (tg50) 4.53 ± 0.00 t/s
Model size 11.38 GiB
RAM usage (with 16K context) ~13GB
Context window used 16,384 tokens

4.5 t/s is comfortable reading speed for a personal assistant workload. Prompt processing benefits from --threads-batch 4 (+26% vs 3 threads).

ARM NEON status: __ARM_FEATURE_DOTPROD confirmed active. The HAVE_FANCY_SIMD is NOT defined warning in ik_llama.cpp output refers exclusively to AVX-512 (x86) and can be safely ignored on ARM.


How to run

Requirements

  • Raspberry Pi 5 16GB (8GB will not work — insufficient RAM)
  • ik_llama.cpp compiled from source
  • Active cooling recommended for sustained inference

Compile ik_llama.cpp on the Pi

git clone https://github.com/ikawrakow/ik_llama.cpp --depth=1
cd ik_llama.cpp
cmake -B build \
  -DGGML_NATIVE=ON \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=OpenBLAS \
  -DGGML_LTO=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4

Note: -DGGML_NATIVE=ON automatically detects Cortex-A76 capabilities (ARM NEON, dot product, fp16). The explicit -march= flags are not needed.

Run as server

./build/bin/llama-server \
  -m Qwen3.5-35B-A3B-IQK-RPi5-16GB.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -t 3 \
  --threads-batch 4 \
  -c 16384 \
  --mlock \
  --log-disable

Run benchmark

./build/bin/llama-bench \
  -m Qwen3.5-35B-A3B-IQK-RPi5-16GB.gguf \
  -t 3 -n 50

OpenAI-compatible API

The server exposes an OpenAI-compatible API at http://localhost:8080/v1. Compatible with any client that supports OpenAI's chat completions endpoint.


How this quantization was generated

The quantization was produced on a MacBook Pro M4 Pro (48GB) using ik_llama.cpp compiled CPU-only (Metal is not supported by ik_llama.cpp).

llama-quantize \
  --imatrix imatrix-Qwen3.5-35B-A3B-BF16.dat \
  --allow-requantize \
  --attn-q-type q8_0 \
  --attn-k-type q8_0 \
  --attn-v-type q8_0 \
  --attn-qkv-type q8_0 \
  --attn-output-type q8_0 \
  --token-embedding-type q6_K \
  --output-tensor-type q6_K \
  --custom-q "ffn_down_shexp=q8_0,ffn_gate_shexp=q8_0,ffn_up_shexp=q8_0" \
  --custom-q "blk\.[0-7]\.ffn_down_exps=iq4_ks,blk\.[0-7]\.ffn_gate_exps=iq4_ks,blk\.[0-7]\.ffn_up_exps=iq4_ks" \
  --custom-q "ffn_down_exps=iq2_ks,ffn_gate_exps=iq2_ks,ffn_up_exps=iq2_ks" \
  source-Q8_0.gguf \
  output.gguf \
  iq2_ks \
  12

Source GGUF: bartowski Q8_0 (~48GB) chosen because the original BF16 weights (~72GB) exceed the MacBook's 48GB RAM, and re-quantizing from Q8_0 introduces less than 0.05 PPL degradation vs. starting from BF16.


Known limitations

  • Context window: 16K is the practical limit on 16GB RPi5 with this model. 32K is possible but leaves very little system RAM headroom.
  • Speed: 4.5 t/s generation is usable for a personal assistant but not for interactive applications requiring fast responses.
  • IQK quants: IQ2_KS and IQ4_KS are ik_llama.cpp exclusive — this GGUF will not load correctly in standard llama.cpp or Ollama.
  • 8GB RPi5: Will not work. The model requires ~12.5GB RAM at 16K context.

Compared to standard quantizations

Quantization File size Est. RAM Fits 16GB RPi5? Source
Q3_K_S (bartowski) 15.3 GB ~17.5 GB Standard
Q3_K_M (bartowski) 16.4 GB ~18.9 GB Standard
UD-IQ4_XS (Unsloth) 17.5 GB ~20 GB Standard
Q4_K_M (bartowski) 22.0 GB ~25 GB Standard
IQK-RPi5 (this) 11.38 GB ~12.5 GB Custom

Use case context

This quantization was developed as part of a hybrid AI personal assistant running on a Raspberry Pi 5, where the local model acts as a privacy-preserving executor for tool use, code generation, and document processing — with cloud models (Gemini Flash/Pro) handling planning and review. The 35B-A3B was chosen over the 9B dense model specifically for its superior reasoning and coding capabilities despite the memory constraints.


Credits


License

This quantization inherits the base model license: Apache 2.0. See Qwen3.5-35B-A3B for details.

Downloads last month
541
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mtrpires/Qwen3.5-35B-A3B-IQK-RPi5-16GB

Quantized
(241)
this model