Qwen3.5-35B-A3B Chimere v1 -- RAMP GGUF
Chimere v1: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for code generation and tool-calling.
RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.
Looking for v3 (best instructions + reasoning)? See Chimere v3 GGUF.
Benchmark Results
v1 strengths: code and tool-calling
| Benchmark | v1 RAMP (this repo) | v3 RAMP | Base Qwen3.5-35B-A3B | Notes |
|---|---|---|---|---|
| HumanEval (30 problems, executed) | 97% | 83% | -- | Best code gen |
| BFCL tool-calling (20 questions) | 90% | 75% | 67.3% | +23 pts vs base |
| IFEval (15 instruction tests) | 67% | 100% | ~91.9% | v3 better here |
| Edge cases (15 adversarial tests) | 87% | 100% | -- | v3 better here |
| GSM8K CoT 8-shot (1,319 qs) | 52.2% | 84.0% | -- | v3 better here |
| Speed (RTX 5060 Ti 16 GB) | ~80 tok/s | ~80 tok/s | -- |
Qualitative agentic tests
| Scenario | v1 | v3 | /10 |
|---|---|---|---|
| Cybersecurity incident response (multi-tool chain) | 4 | 4 | 10 |
| ML pipeline architecture (RAG, 10K users, $50K budget) | 8 | 8 | 10 |
| Rust MoE runtime optimization (async prefetch, CUDA) | 7 | 8 | 10 |
| Total | 19 | 20 | 30 |
Honest assessment
- Strengths: Tool-calling (+23 pts vs base), code gen (97%), fast inference (80 tok/s in 15 GB)
- Weaknesses: Instruction following weaker than v3 (67% vs 100% IFEval), math reasoning degraded (52% GSM8K)
- Why: v1 dataset was heavily weighted toward BFCL tool-calling and Opus code traces. Use v3 for instruction-following or reasoning tasks.
Which version to use?
| Use case | Recommended | Why |
|---|---|---|
| Code generation, debugging | v1 (this repo) | 97% HumanEval |
| Tool-calling, function calling | v1 (this repo) | 90% BFCL (+23 pts vs base) |
| Instruction following, formatting | v3 | 100% IFEval, 100% edge cases |
| Math reasoning | v3 | 84% GSM8K (vs 52% v1) |
| Re-quantization or fine-tuning | BF16 weights | Full precision |
Best of both worlds: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See Chimere ODO.
Usage
# llama.cpp / llama-server
llama-server \
-m Qwen3.5-35B-A3B-Chimere-Distilled-RAMP-v3.gguf \
-ngl 99 --n-cpu-moe 4 -c 32768 \
--flash-attn on --jinja --port 8081
# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0
Recommended sampling parameters
| Mode | temp | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking (default) | 1.0 | 0.95 | 20 | 0.0 |
| Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
| No-think | 0.7 | 0.8 | 20 | 0.0 |
RAMP Quantization Details
Custom per-tensor quality overrides -- critical paths get higher precision. Overall: ~3.78 BPW.
| Tensor | Quant | BPW | Rationale |
|---|---|---|---|
| attn_v (value) | Q8_0 | 8.0 | Most critical -- errors cause hallucinations |
| ssm_alpha, ssm_d | Q8_0 | 8.0 | GDN recurrent params, tiny but hypersensitive |
| attn_k (key) | Q6_K | 6.5 | Important for attention routing |
| ssm_dt | Q6_K | 6.5 | GDN timestep |
| token_embd, output | Q6_K | 6.5 | Shared embeddings |
| attn_q, attn_output | Q5_K | 5.5 | More tolerant |
| ssm_in, ssm_out | Q5_K | 5.5 | SSM projections |
| 256 MoE experts (FFN) | IQ3_S | 3.44 | 80% of params, high MoE redundancy |
- imatrix: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
- Result: 15 GB with zero quality loss on agentic benchmarks vs BF16
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-35B-A3B (MoE, 256 experts) |
| Method | SFT BF16 LoRA r64, completion-only loss |
| Dataset | 9,763 samples (37% BFCL v3 ground truth + 59% Opus reasoning traces + 4% gold samples) |
| Epochs | 1 (611 steps, batch 16) |
| Training GPU | NVIDIA B200 |
| Training cost | ~$5 |
Files
| File | Size | Description |
|---|---|---|
Qwen3.5-35B-A3B-Chimere-Distilled-RAMP-v3.gguf |
15 GB | v1 RAMP GGUF (code + tools focus) |
imatrix.dat |
184 MB | Importance matrix used for quantization |
benchmark_results.json |
-- | Mini-benchmark results (JSON) |
Related
- Chimere v3 GGUF -- Best instructions + reasoning
- BF16 full weights -- For re-quantization or fine-tuning
- LoRA adapter -- For further training
- GitHub: Chimere
- GitHub: Chimere ODO
Citation
@misc{chimere-v1-2026,
title={Chimere v1: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Code and Tool-Calling},
author={Kevletesteur},
year={2026},
url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF}
}
- Downloads last month
- 1,623
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.