Qwen3.5-35B-A3B Chimere v1 -- RAMP GGUF

Chimere v1: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for code generation and tool-calling.

RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.

Looking for v3 (best instructions + reasoning)? See Chimere v3 GGUF.

Benchmark Results

v1 strengths: code and tool-calling

Benchmark	v1 RAMP (this repo)	v3 RAMP	Base Qwen3.5-35B-A3B	Notes
HumanEval (30 problems, executed)	97%	83%	--	Best code gen
BFCL tool-calling (20 questions)	90%	75%	67.3%	+23 pts vs base
IFEval (15 instruction tests)	67%	100%	~91.9%	v3 better here
Edge cases (15 adversarial tests)	87%	100%	--	v3 better here
GSM8K CoT 8-shot (1,319 qs)	52.2%	84.0%	--	v3 better here
Speed (RTX 5060 Ti 16 GB)	~80 tok/s	~80 tok/s	--

Qualitative agentic tests

Scenario	v1	v3	/10
Cybersecurity incident response (multi-tool chain)	4	4	10
ML pipeline architecture (RAG, 10K users, $50K budget)	8	8	10
Rust MoE runtime optimization (async prefetch, CUDA)	7	8	10
Total	19	20	30

Honest assessment

Strengths: Tool-calling (+23 pts vs base), code gen (97%), fast inference (80 tok/s in 15 GB)
Weaknesses: Instruction following weaker than v3 (67% vs 100% IFEval), math reasoning degraded (52% GSM8K)
Why: v1 dataset was heavily weighted toward BFCL tool-calling and Opus code traces. Use v3 for instruction-following or reasoning tasks.

Which version to use?

Use case	Recommended	Why
Code generation, debugging	v1 (this repo)	97% HumanEval
Tool-calling, function calling	v1 (this repo)	90% BFCL (+23 pts vs base)
Instruction following, formatting	v3	100% IFEval, 100% edge cases
Math reasoning	v3	84% GSM8K (vs 52% v1)
Re-quantization or fine-tuning	BF16 weights	Full precision

Best of both worlds: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See Chimere ODO.

Usage

# llama.cpp / llama-server
llama-server \
    -m Qwen3.5-35B-A3B-Chimere-Distilled-RAMP-v3.gguf \
    -ngl 99 --n-cpu-moe 4 -c 32768 \
    --flash-attn on --jinja --port 8081

# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0

Recommended sampling parameters

Mode	temp	top_p	top_k
Thinking (default)	1.0	0.95	20
Thinking + code/tools	0.6	0.95	20
No-think	0.7	0.8	20

RAMP Quantization Details

Custom per-tensor quality overrides -- critical paths get higher precision. Overall: ~3.78 BPW.

Tensor	Quant	BPW	Rationale
attn_v (value)	Q8_0	8.0	Most critical -- errors cause hallucinations
ssm_alpha, ssm_d	Q8_0	8.0	GDN recurrent params, tiny but hypersensitive
attn_k (key)	Q6_K	6.5	Important for attention routing
ssm_dt	Q6_K	6.5	GDN timestep
token_embd, output	Q6_K	6.5	Shared embeddings
attn_q, attn_output	Q5_K	5.5	More tolerant
ssm_in, ssm_out	Q5_K	5.5	SSM projections
256 MoE experts (FFN)	IQ3_S	3.44	80% of params, high MoE redundancy

imatrix: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
Result: 15 GB with zero quality loss on agentic benchmarks vs BF16

Training Details

Parameter	Value
Base model	Qwen/Qwen3.5-35B-A3B (MoE, 256 experts)
Method	SFT BF16 LoRA r64, completion-only loss
Dataset	9,763 samples (37% BFCL v3 ground truth + 59% Opus reasoning traces + 4% gold samples)
Epochs	1 (611 steps, batch 16)
Training GPU	NVIDIA B200
Training cost	~$5

Files

File	Size	Description
`Qwen3.5-35B-A3B-Chimere-Distilled-RAMP-v3.gguf`	15 GB	v1 RAMP GGUF (code + tools focus)
`imatrix.dat`	184 MB	Importance matrix used for quantization
`benchmark_results.json`	--	Mini-benchmark results (JSON)

Chimere v3 GGUF -- Best instructions + reasoning
BF16 full weights -- For re-quantization or fine-tuning
LoRA adapter -- For further training
GitHub: Chimere
GitHub: Chimere ODO

Citation

@misc{chimere-v1-2026,
  title={Chimere v1: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Code and Tool-Calling},
  author={Kevletesteur},
  year={2026},
  url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF}
}

Downloads last month: 1,623

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(240)

this model

Kevletesteur
/

Qwen3.5-35B-A3B-Chimere-Distilled-GGUF