qwen3.5-35b-a3b-compacted-GGUF
GGUF quantizations of continuum-ai/qwen3.5-35b-a3b-compacted โ a compacted MoE model pruned from Qwen3.5-35B-A3B (89 experts removed, 30% smaller) while preserving reasoning quality.
All low-bit quants (Q3, Q2, IQ*) are calibrated with an importance matrix for best quality at each size.
Available Quantizations
| Filename | Quant Type | Size | Notes |
|---|---|---|---|
| qwen3.5-35b-a3b-compacted-Q8_0.gguf | Q8_0 | 24G | Best quality, near-lossless |
| qwen3.5-35b-a3b-compacted-Q6_K.gguf | Q6_K | 18G | Excellent quality |
| qwen3.5-35b-a3b-compacted-Q5_K_M.gguf | Q5_K_M | 16G | Great quality |
| qwen3.5-35b-a3b-compacted-Q5_K_S.gguf | Q5_K_S | 16G | Great quality, slightly smaller |
| qwen3.5-35b-a3b-compacted-Q4_K_M.gguf | Q4_K_M | 14G | Recommended - best balance |
| qwen3.5-35b-a3b-compacted-Q4_K_S.gguf | Q4_K_S | 13G | Good balance |
| qwen3.5-35b-a3b-compacted-IQ4_XS.gguf | IQ4_XS | 12G | imatrix, compact 4-bit |
| qwen3.5-35b-a3b-compacted-Q3_K_L.gguf | Q3_K_L | 12G | imatrix |
| qwen3.5-35b-a3b-compacted-Q3_K_M.gguf | Q3_K_M | 11G | imatrix |
| qwen3.5-35b-a3b-compacted-IQ3_M.gguf | IQ3_M | 9.9G | imatrix, good low-bit |
| qwen3.5-35b-a3b-compacted-IQ3_S.gguf | IQ3_S | 9.7G | imatrix |
| qwen3.5-35b-a3b-compacted-Q3_K_S.gguf | Q3_K_S | 9.7G | imatrix |
| qwen3.5-35b-a3b-compacted-IQ3_XXS.gguf | IQ3_XXS | 8.7G | imatrix |
| qwen3.5-35b-a3b-compacted-Q2_K.gguf | Q2_K | 8.3G | imatrix, low quality |
| qwen3.5-35b-a3b-compacted-IQ2_M.gguf | IQ2_M | 7.5G | imatrix, aggressive |
| qwen3.5-35b-a3b-compacted-IQ2_S.gguf | IQ2_S | 6.9G | imatrix, very aggressive |
| qwen3.5-35b-a3b-compacted-IQ2_XXS.gguf | IQ2_XXS | 6.2G | imatrix, extreme |
| qwen3.5-35b-a3b-compacted-IQ1_M.gguf | IQ1_M | 5.4G | imatrix, maximum compression |
How to Use
With llama.cpp
llama-cli -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -p "Hello" -ngl 999
With llama.cpp server
llama-server -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 999
With Ollama
ollama run hf.co/cahlen/qwen3.5-35b-a3b-compacted-GGUF:Q4_K_M
With LM Studio
Download any GGUF file above and load it in LM Studio.
Choosing a Quant
| Your VRAM | Recommended | Size |
|---|---|---|
| 24GB+ (RTX 4090/5090) | Q8_0 or Q6_K | 24G / 18G |
| 16GB (RTX 4080/5080) | Q4_K_M or Q5_K_S | 14G / 16G |
| 12GB (RTX 4070/3060 12GB) | IQ4_XS or Q3_K_L | 12G |
| 8GB (RTX 4060/3060 8GB) | IQ3_M or Q2_K | 9.9G / 8.3G |
| 6GB (RTX 4050/3050) | IQ2_M or IQ2_S | 7.5G / 6.9G |
| CPU only (16GB+ RAM) | IQ2_XXS or IQ1_M | 6.2G / 5.4G |
About the Source Model
This is a compacted version of Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled created by continuum-ai using Plasticity Compaction โ a technique that prunes underutilized MoE experts based on runtime activation profiling:
- 256 experts reduced to 167 (-35%)
- 67GB reduced to 47GB BF16 (-30%)
- Chain-of-thought reasoning and code generation quality preserved
Perplexity Evaluation (WikiText-2)
Lower is better. BF16 is the unquantized baseline.
| Quant | Size | Perplexity | vs BF16 |
|---|---|---|---|
| BF16 (baseline) | 47G | 9.7245 | -- |
| Q8_0 | 24G | 9.7568 | +0.03% |
| Q5_K_M | 16G | 9.7974 | +0.75% |
| Q4_K_M | 14G | 9.9398 | +2.21% |
| Q3_K_M | 11G | 10.2903 | +5.82% |
| IQ3_M | 9.9G | 10.3416 | +6.34% |
| Q2_K | 8.3G | 11.5866 | +19.1% |
| IQ2_M | 7.5G | 11.7276 | +20.6% |
| IQ1_M | 5.4G | 18.3670 | +88.9% |
Key takeaways:
- Q8_0 through Q4_K_M: Negligible quality loss (<2.2%) โ safe for all use cases
- Q3_K_M / IQ3_M: Moderate degradation (~6%) โ good for constrained hardware
- Q2_K / IQ2_M: Noticeable degradation (~20%) โ usable for casual use
- IQ1_M: Significant quality loss โ only for extreme VRAM constraints
Quantization Details
- Quantized by: cahlen
- Importance matrix: Generated from WikiText-2 (200 chunks) on NVIDIA RTX 5090
- Tool: llama.cpp
- Hardware: NVIDIA RTX 5090 32GB / Intel Core Ultra 9 285K / 188GB RAM
- Downloads last month
- 3,901
Hardware compatibility
Log In to add your hardware
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for cahlen/qwen3.5-35b-a3b-compacted-GGUF
Base model
continuum-ai/qwen3.5-35b-a3b-compacted