Qwen3.5-35B-A3B-heretic-v2-eq-v1-GGUF
GGUF quantizations of nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1 for use with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.
See the bf16 model card for full details on the model, training, and EQ-Bench results.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
*-F16.gguf-00001-of-00009 ... 00009 |
F16 | ~65GB (9 shards) | Full precision — lossless conversion from bf16 |
*-Q4_K_M.gguf-00001-of-00003 ... 00003 |
Q4_K_M | ~20GB (3 shards) | Recommended — best quality/size balance |
*-mmproj-F16.gguf |
F16 | 858MB | Vision projector (required for image input) |
llama.cpp auto-detects split shards — just point to the first file (-00001-of-*).
EQ-Bench 3
Rubric Score: 83.85 (judge: claude-3.7-sonnet) — measured on the bf16 source model.
| Model | Active Params | EQBench Score |
|---|---|---|
| Qwen3.5-35B-A3B-heretic-v2-eq-v1 (ours) | 3B | 83.85 |
| Qwen3.5-27B dense | 27B | 83.05 |
| Qwen3-235B-A22B | 22B | 80.90 |
| QwQ-32B | 32B | 79.90 |
| Qwen3.5-35B-A3B (baseline) | 3B | 77.85 |
| Qwen3-32B | 32B | 74.30 |
| Qwen3-30B-A3B | 3B | 66.00 |
Note: EQ-Bench scores are from the bf16 model. Q4_K_M quantization may slightly affect quality.
Note on judge model: Public EQ-Bench 3 leaderboard scores for this family of models use
claude-3.7-sonnetas the judge, so we use the same for comparability. We plan to publish updated benchmarks with newer judge models (including Opus) in the future.
Usage
llama.cpp
llama-cli \
-m qwen35-35b-heretic-v2-eq-v1-Q4_K_M.gguf-00001-of-00003.gguf \
--mmproj qwen35-35b-heretic-v2-eq-v1-mmproj-F16.gguf \
-p "My best friend got the promotion I wanted. I said congrats but feel terrible. What do I do?" \
-n 512
llama-server
llama-server \
-m qwen35-35b-heretic-v2-eq-v1-Q4_K_M.gguf-00001-of-00003.gguf \
--mmproj qwen35-35b-heretic-v2-eq-v1-mmproj-F16.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
--jinja
--jinja enables tool calling via the bundled chat template. -ngl 99 offloads all layers to GPU.
Performance (single RTX 5090)
- 169 t/s generation
- 211 t/s prompt processing
Conversion Details
- Converted with
convert_hf_to_gguf.pyfrom llama.cpp (commitecbcb7ea9) - Quantized with
llama-quantizeQ4_K_M - Vision projector kept at F16 (should not be quantized)
License
Apache 2.0
- Downloads last month
- 99
4-bit
16-bit
Model tree for nivvis/Qwen3.5-35B-A3B-EQ-v1-GGUF
Base model
Qwen/Qwen3.5-35B-A3B-Base