Qwen3.5-35B-A3B-IQK-RPi5-16GB — Custom IQK Quantization for Raspberry Pi 5 (16GB)
Quantized by: mtrpires
Base model: Qwen/Qwen3.5-35B-A3B
Runtime: ik_llama.cpp
Target hardware: Raspberry Pi 5 16GB (Cortex-A76)
Why this quantization exists
The Qwen3.5-35B-A3B is a Mixture-of-Experts model with 35B total parameters but only 3.3B active per token (256 experts, 8 routed + 1 shared active). This architecture makes it remarkably efficient at inference time — in benchmarks it surpasses the Qwen3-235B-A22B despite activating far fewer parameters, including 69.2% on SWE-bench Verified.
The problem: every standard quantization of this model exceeds 15GB on disk, requiring ~17GB+ of RAM. This puts it out of reach for the Raspberry Pi 5 16GB using conventional tools.
This quantization uses a mixed IQK recipe — asymmetric quantization that exploits the MoE architecture's properties: attention and shared expert layers (always active, quality-critical) are kept at high precision, while routed expert layers (rarely co-activated, quality-resilient) are compressed aggressively using ARM NEON-optimized IQK kernels from ik_llama.cpp.
The result: 11.38GB on disk, ~12.5GB in RAM — fits comfortably in 16GB.
Quantization recipe
| Tensor group | Quantization | Rationale |
|---|---|---|
| Attention layers (all) | q8_0 |
Always active per token — critical for reasoning |
| Shared experts (always active) | q8_0 |
Backbone of MoE — activated every token |
| Routed experts, layers 0–7 | iq4_ks |
Early layers more sensitive — IQK NEON optimized |
| Routed experts, layers 8+ | iq2_ks |
Bulk of model — rarely co-activated — IQK NEON optimized |
| Token embeddings / output | q6_K |
Used every token — good quality/size balance |
The IQ4_KS and IQ2_KS quantization types are exclusive to ik_llama.cpp
and have hand-written ARM NEON assembly kernels that make them faster than
Q4_K_M on Cortex-A76 despite the lower bit depth.
Source: Q8_0 GGUF from bartowski
Imatrix: ubergarm
(calibrated on chat, coding and long-context tool-calling data)
Benchmark results
Tested on Raspberry Pi 5 16GB with ik_llama.cpp (HEAD, March 2026),
-t 3 --threads-batch 4 -c 16384 --mlock:
| Test | Result |
|---|---|
| Prompt processing (pp512) | 30.85 ± 0.03 t/s |
| Token generation (tg50) | 4.53 ± 0.00 t/s |
| Model size | 11.38 GiB |
| RAM usage (with 16K context) | ~13GB |
| Context window used | 16,384 tokens |
4.5 t/s is comfortable reading speed for a personal assistant workload.
Prompt processing benefits from --threads-batch 4 (+26% vs 3 threads).
ARM NEON status: __ARM_FEATURE_DOTPROD confirmed active. The
HAVE_FANCY_SIMD is NOT defined warning in ik_llama.cpp output refers
exclusively to AVX-512 (x86) and can be safely ignored on ARM.
How to run
Requirements
- Raspberry Pi 5 16GB (8GB will not work — insufficient RAM)
- ik_llama.cpp compiled from source
- Active cooling recommended for sustained inference
Compile ik_llama.cpp on the Pi
git clone https://github.com/ikawrakow/ik_llama.cpp --depth=1
cd ik_llama.cpp
cmake -B build \
-DGGML_NATIVE=ON \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_LTO=ON \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4
Note:
-DGGML_NATIVE=ONautomatically detects Cortex-A76 capabilities (ARM NEON, dot product, fp16). The explicit-march=flags are not needed.
Run as server
./build/bin/llama-server \
-m Qwen3.5-35B-A3B-IQK-RPi5-16GB.gguf \
--host 127.0.0.1 \
--port 8080 \
-t 3 \
--threads-batch 4 \
-c 16384 \
--mlock \
--log-disable
Run benchmark
./build/bin/llama-bench \
-m Qwen3.5-35B-A3B-IQK-RPi5-16GB.gguf \
-t 3 -n 50
OpenAI-compatible API
The server exposes an OpenAI-compatible API at http://localhost:8080/v1.
Compatible with any client that supports OpenAI's chat completions endpoint.
How this quantization was generated
The quantization was produced on a MacBook Pro M4 Pro (48GB) using ik_llama.cpp compiled CPU-only (Metal is not supported by ik_llama.cpp).
llama-quantize \
--imatrix imatrix-Qwen3.5-35B-A3B-BF16.dat \
--allow-requantize \
--attn-q-type q8_0 \
--attn-k-type q8_0 \
--attn-v-type q8_0 \
--attn-qkv-type q8_0 \
--attn-output-type q8_0 \
--token-embedding-type q6_K \
--output-tensor-type q6_K \
--custom-q "ffn_down_shexp=q8_0,ffn_gate_shexp=q8_0,ffn_up_shexp=q8_0" \
--custom-q "blk\.[0-7]\.ffn_down_exps=iq4_ks,blk\.[0-7]\.ffn_gate_exps=iq4_ks,blk\.[0-7]\.ffn_up_exps=iq4_ks" \
--custom-q "ffn_down_exps=iq2_ks,ffn_gate_exps=iq2_ks,ffn_up_exps=iq2_ks" \
source-Q8_0.gguf \
output.gguf \
iq2_ks \
12
Source GGUF: bartowski Q8_0 (~48GB) chosen because the original BF16 weights (~72GB) exceed the MacBook's 48GB RAM, and re-quantizing from Q8_0 introduces less than 0.05 PPL degradation vs. starting from BF16.
Known limitations
- Context window: 16K is the practical limit on 16GB RPi5 with this model. 32K is possible but leaves very little system RAM headroom.
- Speed: 4.5 t/s generation is usable for a personal assistant but not for interactive applications requiring fast responses.
- IQK quants:
IQ2_KSandIQ4_KSare ik_llama.cpp exclusive — this GGUF will not load correctly in standard llama.cpp or Ollama. - 8GB RPi5: Will not work. The model requires ~12.5GB RAM at 16K context.
Compared to standard quantizations
| Quantization | File size | Est. RAM | Fits 16GB RPi5? | Source |
|---|---|---|---|---|
| Q3_K_S (bartowski) | 15.3 GB | ~17.5 GB | ❌ | Standard |
| Q3_K_M (bartowski) | 16.4 GB | ~18.9 GB | ❌ | Standard |
| UD-IQ4_XS (Unsloth) | 17.5 GB | ~20 GB | ❌ | Standard |
| Q4_K_M (bartowski) | 22.0 GB | ~25 GB | ❌ | Standard |
| IQK-RPi5 (this) | 11.38 GB | ~12.5 GB | ✅ | Custom |
Use case context
This quantization was developed as part of a hybrid AI personal assistant running on a Raspberry Pi 5, where the local model acts as a privacy-preserving executor for tool use, code generation, and document processing — with cloud models (Gemini Flash/Pro) handling planning and review. The 35B-A3B was chosen over the 9B dense model specifically for its superior reasoning and coding capabilities despite the memory constraints.
Credits
- Base model: Qwen Team (Alibaba Cloud)
- Runtime: ikawrakow (ik_llama.cpp)
- Imatrix: ubergarm
- Source GGUF: bartowski
License
This quantization inherits the base model license: Apache 2.0. See Qwen3.5-35B-A3B for details.
- Downloads last month
- 541
We're not able to determine the quantization variants.