Qwen3.5-35B-A3B Chimere v1 -- RAMP GGUF

Chimere v1: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for code generation and tool-calling.

RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.

Looking for v3 (best instructions + reasoning)? See Chimere v3 GGUF.

Benchmark Results

v1 strengths: code and tool-calling

Benchmark v1 RAMP (this repo) v3 RAMP Base Qwen3.5-35B-A3B Notes
HumanEval (30 problems, executed) 97% 83% -- Best code gen
BFCL tool-calling (20 questions) 90% 75% 67.3% +23 pts vs base
IFEval (15 instruction tests) 67% 100% ~91.9% v3 better here
Edge cases (15 adversarial tests) 87% 100% -- v3 better here
GSM8K CoT 8-shot (1,319 qs) 52.2% 84.0% -- v3 better here
Speed (RTX 5060 Ti 16 GB) ~80 tok/s ~80 tok/s --

Qualitative agentic tests

Scenario v1 v3 /10
Cybersecurity incident response (multi-tool chain) 4 4 10
ML pipeline architecture (RAG, 10K users, $50K budget) 8 8 10
Rust MoE runtime optimization (async prefetch, CUDA) 7 8 10
Total 19 20 30

Honest assessment

  • Strengths: Tool-calling (+23 pts vs base), code gen (97%), fast inference (80 tok/s in 15 GB)
  • Weaknesses: Instruction following weaker than v3 (67% vs 100% IFEval), math reasoning degraded (52% GSM8K)
  • Why: v1 dataset was heavily weighted toward BFCL tool-calling and Opus code traces. Use v3 for instruction-following or reasoning tasks.

Which version to use?

Use case Recommended Why
Code generation, debugging v1 (this repo) 97% HumanEval
Tool-calling, function calling v1 (this repo) 90% BFCL (+23 pts vs base)
Instruction following, formatting v3 100% IFEval, 100% edge cases
Math reasoning v3 84% GSM8K (vs 52% v1)
Re-quantization or fine-tuning BF16 weights Full precision

Best of both worlds: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See Chimere ODO.

Usage

# llama.cpp / llama-server
llama-server \
    -m Qwen3.5-35B-A3B-Chimere-Distilled-RAMP-v3.gguf \
    -ngl 99 --n-cpu-moe 4 -c 32768 \
    --flash-attn on --jinja --port 8081

# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0

Recommended sampling parameters

Mode temp top_p top_k presence_penalty
Thinking (default) 1.0 0.95 20 0.0
Thinking + code/tools 0.6 0.95 20 0.0
No-think 0.7 0.8 20 0.0

RAMP Quantization Details

Custom per-tensor quality overrides -- critical paths get higher precision. Overall: ~3.78 BPW.

Tensor Quant BPW Rationale
attn_v (value) Q8_0 8.0 Most critical -- errors cause hallucinations
ssm_alpha, ssm_d Q8_0 8.0 GDN recurrent params, tiny but hypersensitive
attn_k (key) Q6_K 6.5 Important for attention routing
ssm_dt Q6_K 6.5 GDN timestep
token_embd, output Q6_K 6.5 Shared embeddings
attn_q, attn_output Q5_K 5.5 More tolerant
ssm_in, ssm_out Q5_K 5.5 SSM projections
256 MoE experts (FFN) IQ3_S 3.44 80% of params, high MoE redundancy
  • imatrix: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
  • Result: 15 GB with zero quality loss on agentic benchmarks vs BF16

Training Details

Parameter Value
Base model Qwen/Qwen3.5-35B-A3B (MoE, 256 experts)
Method SFT BF16 LoRA r64, completion-only loss
Dataset 9,763 samples (37% BFCL v3 ground truth + 59% Opus reasoning traces + 4% gold samples)
Epochs 1 (611 steps, batch 16)
Training GPU NVIDIA B200
Training cost ~$5

Files

File Size Description
Qwen3.5-35B-A3B-Chimere-Distilled-RAMP-v3.gguf 15 GB v1 RAMP GGUF (code + tools focus)
imatrix.dat 184 MB Importance matrix used for quantization
benchmark_results.json -- Mini-benchmark results (JSON)

Related

Citation

@misc{chimere-v1-2026,
  title={Chimere v1: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Code and Tool-Calling},
  author={Kevletesteur},
  year={2026},
  url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF}
}
Downloads last month
1,623
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF

Quantized
(240)
this model