Qwen3.5-9B Claude-Distill

A fine-tuned version of Qwen/Qwen3.5-9B through knowledge distillation from Claude. This model is trained with full parameter fine-tuning on curated Claude reasoning traces.

Model Highlights

Claude-Distilled Reasoning: Trained on high-quality chain-of-thought reasoning traces distilled from Claude Opus
Multi-Domain Coverage: Math, logic, coding, creative writing, STEM, and multi-turn reasoning
Dense Architecture: Based on Qwen/Qwen3.5-9B with 9B parameters
Multimodal Capable: Inherits vision-language capabilities from Qwen3.5

Model Description

Property	Value
Base Model	Qwen/Qwen3.5-9B
Model Type	Causal Language Model with Vision Encoder
Parameters	9B
Languages	English, Chinese
License	Apache 2.0
Developer	Kassadin88

Training Data

Distilled from Claude on the following datasets:

Dataset	Samples	Description
Claude Opus 4.5 High Reasoning	250	High reasoning depth samples
Claude Opus 4.6 Reasoning	9,633	Math, logic puzzles, multi-step instructions with CoT
Claude Opus 4.6 High Reasoning	757	Coding and creative writing with adaptive reasoning
Claude Opus 4.6 Extended Reasoning	500	Extended reasoning across STEM and practical domains
Claude Opus 4.6 Extended Reasoning 887x	887	Tool calling, bullshit detection, multi-turn traces
Claude Sonnet & Opus 4.6 Reasoning	524	Natural human-written prompts from Reddit & Stack Overflow
Opus 4.6 Reasoning Filtered	2,326	Filtered reasoning traces (refusals removed)

Total: ~14.9K samples

Data Composition

Domain	Percentage	Description
Math & Logic	~40%	Multi-step problem solving with chain-of-thought
Coding	~25%	Code generation, debugging, and algorithm design
STEM	~15%	Science, engineering, and extended reasoning
Creative Writing	~10%	Adaptive reasoning for creative tasks
Multi-turn / Tool Use	~10%	Tool calling, clarification, and dialogue

Benchmark Results

For detailed benchmark results and model architecture, please refer to the original Qwen/Qwen3.5-9B model card.

Quickstart

For full usage guide, please refer to the original Qwen/Qwen3.5-9B model card.

Using with vLLM

vllm serve Kassadin88/Qwen3.5-9B-Claude-distill \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --trust-remote-code \
    --reasoning-parser qwen3

Using with SGLang

python -m sglang.launch_server \
    --model-path Kassadin88/Qwen3.5-9B-Claude-distill \
    --port 8000 \
    --tp-size 2 \
    --mem-fraction-static 0.8 \
    --context-length 32768 \
    --reasoning-parser qwen3

Using with Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Kassadin88/Qwen3.5-9B-Claude-distill"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Usage Tips

For Reasoning Tasks

messages = [
    {"role": "user", "content": "Solve step by step: What is the sum of all prime numbers less than 100?"}
]
# Model will use chain-of-thought reasoning from Claude distillation

For Coding Tasks

messages = [
    {"role": "user", "content": "Implement a binary search tree with insert, delete, and find operations in Python."}
]
# Model benefits from Claude's coding reasoning traces

Enabling / Disabling Thinking

# Enable thinking mode (recommended for reasoning tasks)
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)

# Disable thinking mode (for simple tasks, faster inference)
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)

Limitations

This model is distilled from Claude and may inherit biases from the training data
The distillation dataset is relatively small (~14.9K samples), which may limit generalization
Should not be used for medical, legal, or financial advice without verification
The model's reasoning capabilities are constrained by the quality and diversity of the distillation data

Citation

@misc{qwen3.5-9b-claude-distill,
    author = {Kassadin88},
    title = {Qwen3.5-9B Claude-Distill: A Claude-Distilled Fine-Tuned Model},
    year = {2026},
    publisher = {HuggingFace},
    url = {https://huggingface.co/Kassadin88/Qwen3.5-9B-Claude-distill}
}

Acknowledgments

Base Model: Qwen Team for Qwen3.5
Training Data: Various Claude Opus reasoning datasets on HuggingFace
Training Framework: DeepSpeed

Note: This model is intended for research and educational purposes. Please use responsibly.

Downloads last month: 626

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for Kassadin88/Qwen3.5-9B-Claude-distill

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(192)

this model