Qwen3.5-35B-A3B-heretic-v2-eq-v1-GGUF

GGUF quantizations of nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1 for use with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

See the bf16 model card for full details on the model, training, and EQ-Bench results.

Files

File Quant Size Notes
*-F16.gguf-00001-of-00009 ... 00009 F16 ~65GB (9 shards) Full precision — lossless conversion from bf16
*-Q4_K_M.gguf-00001-of-00003 ... 00003 Q4_K_M ~20GB (3 shards) Recommended — best quality/size balance
*-mmproj-F16.gguf F16 858MB Vision projector (required for image input)

llama.cpp auto-detects split shards — just point to the first file (-00001-of-*).

EQ-Bench 3

Rubric Score: 83.85 (judge: claude-3.7-sonnet) — measured on the bf16 source model.

Model Active Params EQBench Score
Qwen3.5-35B-A3B-heretic-v2-eq-v1 (ours) 3B 83.85
Qwen3.5-27B dense 27B 83.05
Qwen3-235B-A22B 22B 80.90
QwQ-32B 32B 79.90
Qwen3.5-35B-A3B (baseline) 3B 77.85
Qwen3-32B 32B 74.30
Qwen3-30B-A3B 3B 66.00

Note: EQ-Bench scores are from the bf16 model. Q4_K_M quantization may slightly affect quality.

Note on judge model: Public EQ-Bench 3 leaderboard scores for this family of models use claude-3.7-sonnet as the judge, so we use the same for comparability. We plan to publish updated benchmarks with newer judge models (including Opus) in the future.

Usage

llama.cpp

llama-cli \
  -m qwen35-35b-heretic-v2-eq-v1-Q4_K_M.gguf-00001-of-00003.gguf \
  --mmproj qwen35-35b-heretic-v2-eq-v1-mmproj-F16.gguf \
  -p "My best friend got the promotion I wanted. I said congrats but feel terrible. What do I do?" \
  -n 512

llama-server

llama-server \
  -m qwen35-35b-heretic-v2-eq-v1-Q4_K_M.gguf-00001-of-00003.gguf \
  --mmproj qwen35-35b-heretic-v2-eq-v1-mmproj-F16.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  --jinja

--jinja enables tool calling via the bundled chat template. -ngl 99 offloads all layers to GPU.

Performance (single RTX 5090)

  • 169 t/s generation
  • 211 t/s prompt processing

Conversion Details

  • Converted with convert_hf_to_gguf.py from llama.cpp (commit ecbcb7ea9)
  • Quantized with llama-quantize Q4_K_M
  • Vision projector kept at F16 (should not be quantized)

License

Apache 2.0

Downloads last month
99
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nivvis/Qwen3.5-35B-A3B-EQ-v1-GGUF

Collection including nivvis/Qwen3.5-35B-A3B-EQ-v1-GGUF