qwen3.5-35b-a3b-compacted-GGUF

GGUF quantizations of continuum-ai/qwen3.5-35b-a3b-compacted โ€” a compacted MoE model pruned from Qwen3.5-35B-A3B (89 experts removed, 30% smaller) while preserving reasoning quality.

All low-bit quants (Q3, Q2, IQ*) are calibrated with an importance matrix for best quality at each size.

Available Quantizations

Filename Quant Type Size Notes
qwen3.5-35b-a3b-compacted-Q8_0.gguf Q8_0 24G Best quality, near-lossless
qwen3.5-35b-a3b-compacted-Q6_K.gguf Q6_K 18G Excellent quality
qwen3.5-35b-a3b-compacted-Q5_K_M.gguf Q5_K_M 16G Great quality
qwen3.5-35b-a3b-compacted-Q5_K_S.gguf Q5_K_S 16G Great quality, slightly smaller
qwen3.5-35b-a3b-compacted-Q4_K_M.gguf Q4_K_M 14G Recommended - best balance
qwen3.5-35b-a3b-compacted-Q4_K_S.gguf Q4_K_S 13G Good balance
qwen3.5-35b-a3b-compacted-IQ4_XS.gguf IQ4_XS 12G imatrix, compact 4-bit
qwen3.5-35b-a3b-compacted-Q3_K_L.gguf Q3_K_L 12G imatrix
qwen3.5-35b-a3b-compacted-Q3_K_M.gguf Q3_K_M 11G imatrix
qwen3.5-35b-a3b-compacted-IQ3_M.gguf IQ3_M 9.9G imatrix, good low-bit
qwen3.5-35b-a3b-compacted-IQ3_S.gguf IQ3_S 9.7G imatrix
qwen3.5-35b-a3b-compacted-Q3_K_S.gguf Q3_K_S 9.7G imatrix
qwen3.5-35b-a3b-compacted-IQ3_XXS.gguf IQ3_XXS 8.7G imatrix
qwen3.5-35b-a3b-compacted-Q2_K.gguf Q2_K 8.3G imatrix, low quality
qwen3.5-35b-a3b-compacted-IQ2_M.gguf IQ2_M 7.5G imatrix, aggressive
qwen3.5-35b-a3b-compacted-IQ2_S.gguf IQ2_S 6.9G imatrix, very aggressive
qwen3.5-35b-a3b-compacted-IQ2_XXS.gguf IQ2_XXS 6.2G imatrix, extreme
qwen3.5-35b-a3b-compacted-IQ1_M.gguf IQ1_M 5.4G imatrix, maximum compression

How to Use

With llama.cpp

llama-cli -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -p "Hello" -ngl 999

With llama.cpp server

llama-server -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 999

With Ollama

ollama run hf.co/cahlen/qwen3.5-35b-a3b-compacted-GGUF:Q4_K_M

With LM Studio

Download any GGUF file above and load it in LM Studio.

Choosing a Quant

Your VRAM Recommended Size
24GB+ (RTX 4090/5090) Q8_0 or Q6_K 24G / 18G
16GB (RTX 4080/5080) Q4_K_M or Q5_K_S 14G / 16G
12GB (RTX 4070/3060 12GB) IQ4_XS or Q3_K_L 12G
8GB (RTX 4060/3060 8GB) IQ3_M or Q2_K 9.9G / 8.3G
6GB (RTX 4050/3050) IQ2_M or IQ2_S 7.5G / 6.9G
CPU only (16GB+ RAM) IQ2_XXS or IQ1_M 6.2G / 5.4G

About the Source Model

This is a compacted version of Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled created by continuum-ai using Plasticity Compaction โ€” a technique that prunes underutilized MoE experts based on runtime activation profiling:

  • 256 experts reduced to 167 (-35%)
  • 67GB reduced to 47GB BF16 (-30%)
  • Chain-of-thought reasoning and code generation quality preserved

Perplexity Evaluation (WikiText-2)

Lower is better. BF16 is the unquantized baseline.

Quant Size Perplexity vs BF16
BF16 (baseline) 47G 9.7245 --
Q8_0 24G 9.7568 +0.03%
Q5_K_M 16G 9.7974 +0.75%
Q4_K_M 14G 9.9398 +2.21%
Q3_K_M 11G 10.2903 +5.82%
IQ3_M 9.9G 10.3416 +6.34%
Q2_K 8.3G 11.5866 +19.1%
IQ2_M 7.5G 11.7276 +20.6%
IQ1_M 5.4G 18.3670 +88.9%

Key takeaways:

  • Q8_0 through Q4_K_M: Negligible quality loss (<2.2%) โ€” safe for all use cases
  • Q3_K_M / IQ3_M: Moderate degradation (~6%) โ€” good for constrained hardware
  • Q2_K / IQ2_M: Noticeable degradation (~20%) โ€” usable for casual use
  • IQ1_M: Significant quality loss โ€” only for extreme VRAM constraints

Quantization Details

  • Quantized by: cahlen
  • Importance matrix: Generated from WikiText-2 (200 chunks) on NVIDIA RTX 5090
  • Tool: llama.cpp
  • Hardware: NVIDIA RTX 5090 32GB / Intel Core Ultra 9 285K / 188GB RAM
Downloads last month
3,901
GGUF
Model size
23B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cahlen/qwen3.5-35b-a3b-compacted-GGUF

Quantized
(1)
this model