⚠️ CRITICAL: Ollama Inference Flag Required
If you serve this model via Ollama with the qwen3.5 RENDERER (the standard recommended setup), you MUST pass
"think": falsein the/api/chatrequest body for chat / instruction following / tool use.curl -X POST http://localhost:11434/api/chat \ -d '{"model": "...", "think": false, "messages": [...], "stream": false}'Without this flag, the renderer auto-injects
<think>tags into every chat completion. On longer prompts the model can stay inside the<think>block past the response budget, never emit</think>, and produce zero answer tokens on 25-46% of requests.Set
think: true(or omit) only when you DO want chain-of-thought reasoning (math, planning, complex multi-step). This is Qwen3 dual-mode operation per https://qwenlm.github.io/blog/qwen3/.See the dataset
cudabenchmarktest/r9-research-framework_OLLAMA_INFERENCE_WARNING.mdfor the full explanation.
Qwen3.5-9B R7 Research (FP16)
Fine-tuned Qwen3.5-9B with distilled reasoning from research-backed datasets. Text-only model (no vision encoder). For the vision-capable version, see cudabenchmarktest/qwen3.5-9b-r7-research-vision.
This is the pre-GGUF FP16 safetensors checkpoint. For quantized GGUF versions ready for Ollama, see robit/qwen3.5-9b-r7-research on Ollama.
Capabilities
- Thinking — produces structured reasoning in
<think>blocks - Tool calling — structured function calls when given tool definitions
- Instruction following — concise answers, format constraints, system prompt adherence
Eval Results
| Benchmark | Score |
|---|---|
| Diverse stochastic eval (38 tests, 9 categories) | 86.8% |
| Base qwen3.5:9b on same eval | 79.0% |
Training Details
- Method: LoRA SFT (r=32, alpha=64, LR=1e-4, 1 epoch, completion-only loss masking)
- Hardware: 3x NVIDIA A100 80GB (DDP)
- Data: Additive mix of 4043 samples:
- bespokelabs/Bespoke-Stratos-17k — DeepSeek-R1 reasoning traces (2000 samples)
- allenai/tulu-3-sft-mixture — instruction diversity (1358 samples)
- Open-Orca/SlimOrca — curated GPT-4 instructions (451 samples)
- PrimeIntellect/SYNTHETIC-1-SFT-Data — verified math/code/STEM (312 samples)
- Static format-constrained, conversational, and code examples (remaining)
- Best eval loss: 0.4950
Training Suite
Full training pipeline, evaluation scripts, and documentation: robit-man/fine_tuning_suite
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"cudabenchmarktest/qwen3.5-9b-r7-research",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("cudabenchmarktest/qwen3.5-9b-r7-research")
messages = [{"role": "user", "content": "What is the capital of France?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
License
Apache 2.0 (inherited from Qwen3.5-9B). Training data licenses vary by source.
- Downloads last month
- 458