Qwen3.5-35B-A3B Chimere v2 -- RAMP GGUF (transitional)

Chimere v2: intermediate Claude Opus 4.6 distillation of Qwen3.5-35B-A3B between v1 (best code + tools) and v3 (best instructions + reasoning).

Most users should pick v3 (best overall) or v1 (best code). v2 is kept here for reproducibility and for the small set of users that benefit from its specific profile.

RAMP quantization (per-tensor quality overrides + imatrix), ~15 GB, fits 16 GB VRAM.

Why v2 exists

v2 was the first iteration of the RAMP-quantized Chimere line, distilled before the v3 dataset was finalized. It carries:

  • A stronger code / tool-calling profile than v3 (close to v1)
  • A weaker instruction-following / GSM8K reasoning profile than v3
  • Speed measured at ~64 tok/s on RTX 5060 Ti (older backend; v3 reaches ~80 tok/s on the same hardware via the chimere-server stack)

If you need agentic + tool-calling close to v1 quality, but also want some of the v3 reasoning improvements, v2 is the in-between point. Otherwise pick v3.

Benchmark Results

Agent benchmarks (HumanEval / BFCL / IFEval)

From benchmarks/agent_ramp_v2.json:

Benchmark v2 RAMP (this repo) v1 RAMP v3 RAMP Notes
HumanEval (30 problems, executed) 93.3% (28/30) 97% 83% v2 close to v1 on code
BFCL tool-calling (20 questions) 90.0% (18/20) 90% 75% v2 = v1 on tools
IFEval (15 instruction tests) 86.7% (13/15) 67% 100% v2 better than v1 but worse than v3

Formal benchmarks (full GSM8K + code + tools)

From benchmarks/formal_ramp_v2.json — note: the formal eval harness used at v2 time had broken parsing for GSM8K / tools / code, those numbers are framework artifacts, not model performance. The agent bench above is the reliable signal.

Metric Value Status
Speed (RTX 5060 Ti, llama.cpp) 63.9 tok/s ✅ measured
GSM8K (1319 problems) 0.2% ❌ harness crash, ignore
Tools (10 problems) 0% ❌ harness crash, ignore
Code (5 problems) 0% ❌ harness crash, ignore

For a clean GSM8K / code / tool eval on the Chimere line, see v3 GGUF.

Which version to use?

Use case Recommended
General agentic, balanced v3 RAMP
Code generation, tool-calling, no instruction-strict v1 RAMP
Code + tools but slightly more instruction-following than v1 v2 (this repo)
Re-quantization or fine-tuning BF16 weights

Usage

The same chimere-server runtime that hosts v3 also runs v2 (same qwen35moe architecture, same RAMP quant family). See the v3 model card for the full chimere-server quick start; just swap the GGUF path.

# Stock llama.cpp / llama-server (no chimere features)
llama-server \
    -m chimere-v2-ramp.gguf \
    -ngl 99 --n-cpu-moe 4 -c 32768 \
    --flash-attn on --jinja --port 8081

# For 16 GB VRAM, add KV cache quantization:
# -ctk q8_0 -ctv q4_0

Files

File Size Description
chimere-v2-ramp.gguf ~15 GB v2 RAMP GGUF (code + tools focus, transitional)
imatrix.dat 184 MB Importance matrix used for quantization
benchmarks/agent_ramp_v2.json <1 KB HumanEval / BFCL / IFEval results
benchmarks/agent_bf16_v2.json <1 KB Same evals on the BF16 reference
benchmarks/formal_ramp_v2.json <1 KB Speed + (broken) GSM8K / tools / code harness

Related

Citation

@misc{chimere-v2-2026,
  title={Chimere v2: Intermediate Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE},
  author={Kevletesteur},
  year={2026},
  url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v2-GGUF}
}
Downloads last month
485
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v2-GGUF

Quantized
(241)
this model