Qwen3.5-35B-A3B Chimere Distilled -- LoRA Adapter
LoRA adapter from the Chimere distillation (Claude Opus 4.6 into Qwen3.5-35B-A3B).
This is not a standalone model. It requires the base model unsloth/Qwen3.5-35B-A3B to use.
Why is this adapter 44.7 GB?
This LoRA targets all 256 MoE experts in addition to the standard attention projections. In a typical dense model, an r=64 LoRA is 1-5 GB. But Qwen3.5-35B-A3B has 256 experts per MoE layer, each with gate/up/down projections -- so the adapter includes LoRA weights for every expert, resulting in the large size. This is intentional: distilling Opus reasoning into a MoE architecture requires adapting the expert routing.
When to use this repo
| Goal | Use this? | Alternative |
|---|---|---|
| Continue fine-tuning from Chimere checkpoint | Yes | Or use BF16 merged weights |
| Merge into BF16 for custom quantization | Yes | Or download pre-merged BF16 |
| A-LoRA routing (swap adapters at runtime) | Yes | -- |
| Run inference directly | No | Use v1 GGUF or v3 GGUF |
Usage
Load with PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model_id = "unsloth/Qwen3.5-35B-A3B"
adapter_id = "Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA"
# Load base model
tokenizer = AutoTokenizer.from_pretrained(adapter_id) # Tokenizer included in adapter repo
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype="bfloat16",
device_map="auto",
)
# Apply LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
messages = [{"role": "user", "content": "Write a Python function to parse JSON."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, top_p=0.8, top_k=20)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Merge LoRA into base model
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen3.5-35B-A3B",
torch_dtype="bfloat16",
device_map="cpu", # Merge on CPU to avoid VRAM issues
)
model = PeftModel.from_pretrained(base_model, "Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA")
merged = model.merge_and_unload()
merged.save_pretrained("./chimere-merged-bf16")
Continue fine-tuning with Unsloth
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA",
max_seq_length=32768,
dtype="bfloat16",
load_in_4bit=True,
)
# The LoRA is already applied -- continue training with your dataset
# ...
LoRA Configuration
| Parameter | Value |
|---|---|
| Base model | unsloth/Qwen3.5-35B-A3B |
| PEFT type | LoRA |
| Rank (r) | 64 |
| Alpha | 64 (alpha/r ratio = 1.0) |
| Dropout | 0 |
| Bias | none |
| Task | CAUSAL_LM |
| PEFT version | 0.18.1 |
Target modules
Attention projections: q_proj, k_proj, v_proj, o_proj
MoE expert projections (all 256 experts): gate_proj, up_proj, down_proj
MoE fused expert tensors: mlp.experts.gate_up_proj, mlp.experts.down_proj
This broad targeting (attention + all MoE experts) is why the adapter is 44.7 GB -- it adapts the routing behavior of every expert.
Files
| File | Description |
|---|---|
adapter_model.safetensors |
LoRA adapter weights (~44.7 GB) |
adapter_config.json |
LoRA configuration (rank, alpha, target modules) |
tokenizer.json, tokenizer_config.json |
Tokenizer (same as base model) |
chat_template.jinja |
Chat template (Qwen3.5 format) |
processor_config.json |
Processor config |
Training Details
| Parameter | Value |
|---|---|
| Method | SFT BF16 LoRA r64, completion-only loss |
| Dataset | 9,763 samples (37% BFCL v3 ground truth + 59% Opus reasoning traces + 4% gold) |
| Distillation source | Claude Opus 4.6 reasoning traces |
| Epochs | 1 (611 steps, batch 16) |
| GPU | NVIDIA B200 |
| Cost | ~$5 |
| Framework | Unsloth + PEFT 0.18.1 |
Related
- Chimere v1 GGUF -- v1 RAMP quantized, ready for inference
- Chimere v3 GGUF -- v3 RAMP quantized, ready for inference
- BF16 merged weights -- Pre-merged BF16 (no adapter loading needed)
- Base model: Qwen3.5-35B-A3B (official)
- Base model: unsloth/Qwen3.5-35B-A3B (Unsloth optimized, used for training)
- GitHub: Chimere
- GitHub: Chimere ODO
Citation
@misc{chimere-distilled-2026,
title={Chimere: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Agentic Local Inference},
author={Kevletesteur},
year={2026},
url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA}
}
- Downloads last month
- 41
Model tree for Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA
Base model
Qwen/Qwen3.5-35B-A3B-Base