Qwen3.5-35B-A3B Chimere v2 -- RAMP GGUF (transitional)
Chimere v2: intermediate Claude Opus 4.6 distillation of Qwen3.5-35B-A3B between v1 (best code + tools) and v3 (best instructions + reasoning).
Most users should pick v3 (best overall) or v1 (best code). v2 is kept here for reproducibility and for the small set of users that benefit from its specific profile.
RAMP quantization (per-tensor quality overrides + imatrix), ~15 GB, fits 16 GB VRAM.
Why v2 exists
v2 was the first iteration of the RAMP-quantized Chimere line, distilled before the v3 dataset was finalized. It carries:
- A stronger code / tool-calling profile than v3 (close to v1)
- A weaker instruction-following / GSM8K reasoning profile than v3
- Speed measured at ~64 tok/s on RTX 5060 Ti (older backend; v3 reaches ~80 tok/s on the same hardware via the chimere-server stack)
If you need agentic + tool-calling close to v1 quality, but also want some of the v3 reasoning improvements, v2 is the in-between point. Otherwise pick v3.
Benchmark Results
Agent benchmarks (HumanEval / BFCL / IFEval)
From benchmarks/agent_ramp_v2.json:
| Benchmark | v2 RAMP (this repo) | v1 RAMP | v3 RAMP | Notes |
|---|---|---|---|---|
| HumanEval (30 problems, executed) | 93.3% (28/30) | 97% | 83% | v2 close to v1 on code |
| BFCL tool-calling (20 questions) | 90.0% (18/20) | 90% | 75% | v2 = v1 on tools |
| IFEval (15 instruction tests) | 86.7% (13/15) | 67% | 100% | v2 better than v1 but worse than v3 |
Formal benchmarks (full GSM8K + code + tools)
From benchmarks/formal_ramp_v2.json — note: the formal eval harness used at v2 time had broken parsing for GSM8K / tools / code, those numbers are framework artifacts, not model performance. The agent bench above is the reliable signal.
| Metric | Value | Status |
|---|---|---|
| Speed (RTX 5060 Ti, llama.cpp) | 63.9 tok/s | ✅ measured |
| GSM8K (1319 problems) | 0.2% | ❌ harness crash, ignore |
| Tools (10 problems) | 0% | ❌ harness crash, ignore |
| Code (5 problems) | 0% | ❌ harness crash, ignore |
For a clean GSM8K / code / tool eval on the Chimere line, see v3 GGUF.
Which version to use?
| Use case | Recommended |
|---|---|
| General agentic, balanced | v3 RAMP |
| Code generation, tool-calling, no instruction-strict | v1 RAMP |
| Code + tools but slightly more instruction-following than v1 | v2 (this repo) |
| Re-quantization or fine-tuning | BF16 weights |
Usage
The same chimere-server runtime that hosts v3 also runs v2 (same qwen35moe architecture, same RAMP quant family). See the v3 model card for the full chimere-server quick start; just swap the GGUF path.
# Stock llama.cpp / llama-server (no chimere features)
llama-server \
-m chimere-v2-ramp.gguf \
-ngl 99 --n-cpu-moe 4 -c 32768 \
--flash-attn on --jinja --port 8081
# For 16 GB VRAM, add KV cache quantization:
# -ctk q8_0 -ctv q4_0
Files
| File | Size | Description |
|---|---|---|
chimere-v2-ramp.gguf |
~15 GB | v2 RAMP GGUF (code + tools focus, transitional) |
imatrix.dat |
184 MB | Importance matrix used for quantization |
benchmarks/agent_ramp_v2.json |
<1 KB | HumanEval / BFCL / IFEval results |
benchmarks/agent_bf16_v2.json |
<1 KB | Same evals on the BF16 reference |
benchmarks/formal_ramp_v2.json |
<1 KB | Speed + (broken) GSM8K / tools / code harness |
Related
- Chimere v3 GGUF — Recommended for most users (best instructions + reasoning)
- Chimere v1 GGUF — Best code + tool-calling
- BF16 full weights — For re-quantization or fine-tuning
- chimere-server — Official Rust runtime
- ik_llama.cpp fork — Backend with Mamba-2 + Nemotron-H backport (PR #1593)
Citation
@misc{chimere-v2-2026,
title={Chimere v2: Intermediate Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE},
author={Kevletesteur},
year={2026},
url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v2-GGUF}
}
- Downloads last month
- 485
We're not able to determine the quantization variants.