qwen3.5-35b-a3b-compacted-GGUF

GGUF quantizations of continuum-ai/qwen3.5-35b-a3b-compacted — a compacted MoE model pruned from Qwen3.5-35B-A3B (89 experts removed, 30% smaller) while preserving reasoning quality.

All low-bit quants (Q3, Q2, IQ*) are calibrated with an importance matrix for best quality at each size.

Available Quantizations

Filename	Quant Type	Size	Notes
qwen3.5-35b-a3b-compacted-Q8_0.gguf	Q8_0	24G	Best quality, near-lossless
qwen3.5-35b-a3b-compacted-Q6_K.gguf	Q6_K	18G	Excellent quality
qwen3.5-35b-a3b-compacted-Q5_K_M.gguf	Q5_K_M	16G	Great quality
qwen3.5-35b-a3b-compacted-Q5_K_S.gguf	Q5_K_S	16G	Great quality, slightly smaller
qwen3.5-35b-a3b-compacted-Q4_K_M.gguf	Q4_K_M	14G	Recommended - best balance
qwen3.5-35b-a3b-compacted-Q4_K_S.gguf	Q4_K_S	13G	Good balance
qwen3.5-35b-a3b-compacted-IQ4_XS.gguf	IQ4_XS	12G	imatrix, compact 4-bit
qwen3.5-35b-a3b-compacted-Q3_K_L.gguf	Q3_K_L	12G	imatrix
qwen3.5-35b-a3b-compacted-Q3_K_M.gguf	Q3_K_M	11G	imatrix
qwen3.5-35b-a3b-compacted-IQ3_M.gguf	IQ3_M	9.9G	imatrix, good low-bit
qwen3.5-35b-a3b-compacted-IQ3_S.gguf	IQ3_S	9.7G	imatrix
qwen3.5-35b-a3b-compacted-Q3_K_S.gguf	Q3_K_S	9.7G	imatrix
qwen3.5-35b-a3b-compacted-IQ3_XXS.gguf	IQ3_XXS	8.7G	imatrix
qwen3.5-35b-a3b-compacted-Q2_K.gguf	Q2_K	8.3G	imatrix, low quality
qwen3.5-35b-a3b-compacted-IQ2_M.gguf	IQ2_M	7.5G	imatrix, aggressive
qwen3.5-35b-a3b-compacted-IQ2_S.gguf	IQ2_S	6.9G	imatrix, very aggressive
qwen3.5-35b-a3b-compacted-IQ2_XXS.gguf	IQ2_XXS	6.2G	imatrix, extreme
qwen3.5-35b-a3b-compacted-IQ1_M.gguf	IQ1_M	5.4G	imatrix, maximum compression

How to Use

With llama.cpp

llama-cli -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -p "Hello" -ngl 999

With llama.cpp server

llama-server -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 999

With Ollama

ollama run hf.co/cahlen/qwen3.5-35b-a3b-compacted-GGUF:Q4_K_M

With LM Studio

Download any GGUF file above and load it in LM Studio.

Choosing a Quant

Your VRAM	Recommended	Size
24GB+ (RTX 4090/5090)	Q8_0 or Q6_K	24G / 18G
16GB (RTX 4080/5080)	Q4_K_M or Q5_K_S	14G / 16G
12GB (RTX 4070/3060 12GB)	IQ4_XS or Q3_K_L	12G
8GB (RTX 4060/3060 8GB)	IQ3_M or Q2_K	9.9G / 8.3G
6GB (RTX 4050/3050)	IQ2_M or IQ2_S	7.5G / 6.9G
CPU only (16GB+ RAM)	IQ2_XXS or IQ1_M	6.2G / 5.4G

About the Source Model

This is a compacted version of Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled created by continuum-ai using Plasticity Compaction — a technique that prunes underutilized MoE experts based on runtime activation profiling:

256 experts reduced to 167 (-35%)
67GB reduced to 47GB BF16 (-30%)
Chain-of-thought reasoning and code generation quality preserved

Perplexity Evaluation (WikiText-2)

Lower is better. BF16 is the unquantized baseline.

Quant	Size	Perplexity	vs BF16
BF16 (baseline)	47G	9.7245	--
Q8_0	24G	9.7568	+0.03%
Q5_K_M	16G	9.7974	+0.75%
Q4_K_M	14G	9.9398	+2.21%
Q3_K_M	11G	10.2903	+5.82%
IQ3_M	9.9G	10.3416	+6.34%
Q2_K	8.3G	11.5866	+19.1%
IQ2_M	7.5G	11.7276	+20.6%
IQ1_M	5.4G	18.3670	+88.9%

Key takeaways:

Q8_0 through Q4_K_M: Negligible quality loss (<2.2%) — safe for all use cases
Q3_K_M / IQ3_M: Moderate degradation (~6%) — good for constrained hardware
Q2_K / IQ2_M: Noticeable degradation (~20%) — usable for casual use
IQ1_M: Significant quality loss — only for extreme VRAM constraints

Quantization Details

Quantized by: cahlen
Importance matrix: Generated from WikiText-2 (200 chunks) on NVIDIA RTX 5090
Tool: llama.cpp
Hardware: NVIDIA RTX 5090 32GB / Intel Core Ultra 9 285K / 188GB RAM

Downloads last month: 3,901

GGUF

Model size

23B params

Architecture

qwen35moe

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cahlen/qwen3.5-35b-a3b-compacted-GGUF

Base model

continuum-ai/qwen3.5-35b-a3b-compacted

Quantized

(1)

this model