⚠️ CRITICAL: Ollama Inference Flag Required

If you serve this model via Ollama with the qwen3.5 RENDERER (the standard recommended setup), you MUST pass "think": false in the /api/chat request body for chat / instruction following / tool use.
curl -X POST http://localhost:11434/api/chat \
  -d '{"model": "...", "think": false, "messages": [...], "stream": false}'
Without this flag, the renderer auto-injects <think> tags into every chat completion. On longer prompts the model can stay inside the <think> block past the response budget, never emit </think>, and produce zero answer tokens on 25-46% of requests.

Set think: true (or omit) only when you DO want chain-of-thought reasoning (math, planning, complex multi-step). This is Qwen3 dual-mode operation per https://qwenlm.github.io/blog/qwen3/.

See the dataset cudabenchmarktest/r9-research-framework _OLLAMA_INFERENCE_WARNING.md for the full explanation.

Qwen3.5-9B R7 Research (FP16)

Fine-tuned Qwen3.5-9B with distilled reasoning from research-backed datasets. Text-only model (no vision encoder). For the vision-capable version, see cudabenchmarktest/qwen3.5-9b-r7-research-vision.

This is the pre-GGUF FP16 safetensors checkpoint. For quantized GGUF versions ready for Ollama, see robit/qwen3.5-9b-r7-research on Ollama.

Capabilities

Thinking — produces structured reasoning in <think> blocks
Tool calling — structured function calls when given tool definitions
Instruction following — concise answers, format constraints, system prompt adherence

Eval Results

Benchmark	Score
Diverse stochastic eval (38 tests, 9 categories)	86.8%
Base qwen3.5:9b on same eval	79.0%

Training Details

Method: LoRA SFT (r=32, alpha=64, LR=1e-4, 1 epoch, completion-only loss masking)
Hardware: 3x NVIDIA A100 80GB (DDP)
Data: Additive mix of 4043 samples:
- bespokelabs/Bespoke-Stratos-17k — DeepSeek-R1 reasoning traces (2000 samples)
- allenai/tulu-3-sft-mixture — instruction diversity (1358 samples)
- Open-Orca/SlimOrca — curated GPT-4 instructions (451 samples)
- PrimeIntellect/SYNTHETIC-1-SFT-Data — verified math/code/STEM (312 samples)
- Static format-constrained, conversational, and code examples (remaining)
Best eval loss: 0.4950

Training Suite

Full training pipeline, evaluation scripts, and documentation: robit-man/fine_tuning_suite

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "cudabenchmarktest/qwen3.5-9b-r7-research",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("cudabenchmarktest/qwen3.5-9b-r7-research")

messages = [{"role": "user", "content": "What is the capital of France?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))