Qwen3.5-35B-A3B Chimere v2 -- RAMP GGUF (transitional)

Chimere v2: intermediate Claude Opus 4.6 distillation of Qwen3.5-35B-A3B between v1 (best code + tools) and v3 (best instructions + reasoning).

Most users should pick v3 (best overall) or v1 (best code). v2 is kept here for reproducibility and for the small set of users that benefit from its specific profile.

RAMP quantization (per-tensor quality overrides + imatrix), ~15 GB, fits 16 GB VRAM.

Why v2 exists

v2 was the first iteration of the RAMP-quantized Chimere line, distilled before the v3 dataset was finalized. It carries:

A stronger code / tool-calling profile than v3 (close to v1)
A weaker instruction-following / GSM8K reasoning profile than v3
Speed measured at ~64 tok/s on RTX 5060 Ti (older backend; v3 reaches ~80 tok/s on the same hardware via the chimere-server stack)

If you need agentic + tool-calling close to v1 quality, but also want some of the v3 reasoning improvements, v2 is the in-between point. Otherwise pick v3.

Benchmark Results

Agent benchmarks (HumanEval / BFCL / IFEval)

From benchmarks/agent_ramp_v2.json:

Benchmark	v2 RAMP (this repo)	v1 RAMP	v3 RAMP	Notes
HumanEval (30 problems, executed)	93.3% (28/30)	97%	83%	v2 close to v1 on code
BFCL tool-calling (20 questions)	90.0% (18/20)	90%	75%	v2 = v1 on tools
IFEval (15 instruction tests)	86.7% (13/15)	67%	100%	v2 better than v1 but worse than v3

Formal benchmarks (full GSM8K + code + tools)

From benchmarks/formal_ramp_v2.json — note: the formal eval harness used at v2 time had broken parsing for GSM8K / tools / code, those numbers are framework artifacts, not model performance. The agent bench above is the reliable signal.

Metric	Value	Status
Speed (RTX 5060 Ti, llama.cpp)	63.9 tok/s	✅ measured
GSM8K (1319 problems)	0.2%	❌ harness crash, ignore
Tools (10 problems)	0%	❌ harness crash, ignore
Code (5 problems)	0%	❌ harness crash, ignore

For a clean GSM8K / code / tool eval on the Chimere line, see v3 GGUF.

Which version to use?

Use case	Recommended
General agentic, balanced	v3 RAMP
Code generation, tool-calling, no instruction-strict	v1 RAMP
Code + tools but slightly more instruction-following than v1	v2 (this repo)
Re-quantization or fine-tuning	BF16 weights

Usage

The same chimere-server runtime that hosts v3 also runs v2 (same qwen35moe architecture, same RAMP quant family). See the v3 model card for the full chimere-server quick start; just swap the GGUF path.

# Stock llama.cpp / llama-server (no chimere features)
llama-server \
    -m chimere-v2-ramp.gguf \
    -ngl 99 --n-cpu-moe 4 -c 32768 \
    --flash-attn on --jinja --port 8081

# For 16 GB VRAM, add KV cache quantization:
# -ctk q8_0 -ctv q4_0

Files

File	Size	Description
`chimere-v2-ramp.gguf`	~15 GB	v2 RAMP GGUF (code + tools focus, transitional)
`imatrix.dat`	184 MB	Importance matrix used for quantization
`benchmarks/agent_ramp_v2.json`	<1 KB	HumanEval / BFCL / IFEval results
`benchmarks/agent_bf16_v2.json`	<1 KB	Same evals on the BF16 reference
`benchmarks/formal_ramp_v2.json`	<1 KB	Speed + (broken) GSM8K / tools / code harness

Chimere v3 GGUF — Recommended for most users (best instructions + reasoning)
Chimere v1 GGUF — Best code + tool-calling
BF16 full weights — For re-quantization or fine-tuning
chimere-server — Official Rust runtime
ik_llama.cpp fork — Backend with Mamba-2 + Nemotron-H backport (PR #1593)

Citation

@misc{chimere-v2-2026,
  title={Chimere v2: Intermediate Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE},
  author={Kevletesteur},
  year={2026},
  url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v2-GGUF}
}

Downloads last month: 485

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v2-GGUF

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(241)

this model

Kevletesteur
/

Qwen3.5-35B-A3B-Chimere-v2-GGUF