⚠️ CRITICAL: Ollama Inference Flag Required

If you serve this model via Ollama with the qwen3.5 RENDERER (the standard recommended setup), you MUST pass "think": false in the /api/chat request body for chat / instruction following / tool use.
curl -X POST http://localhost:11434/api/chat \
  -d '{"model": "...", "think": false, "messages": [...], "stream": false}'
Without this flag, the renderer auto-injects <think> tags into every chat completion. On longer prompts the model can stay inside the <think> block past the response budget, never emit </think>, and produce zero answer tokens on 25-46% of requests.

Set think: true (or omit) only when you DO want chain-of-thought reasoning (math, planning, complex multi-step). This is Qwen3 dual-mode operation per https://qwenlm.github.io/blog/qwen3/.

See the dataset cudabenchmarktest/r9-research-framework _OLLAMA_INFERENCE_WARNING.md for the full explanation.

Qwen3.5-9B R7 Research Vision (FP16)

Fine-tuned Qwen3.5-9B with distilled reasoning and full vision support. 760 tensors (427 text + 333 vision) — vision tower preserved byte-for-byte from base via splice. For the text-only version, see cudabenchmarktest/qwen3.5-9b-r7-research.

This is the pre-GGUF FP16 safetensors checkpoint. For quantized GGUF versions ready for Ollama, see robit/qwen3.5-9b-r7-research-vision on Ollama.

Capabilities

Vision — image understanding (reads text, describes scenes, answers visual questions)
Thinking — produces structured reasoning in <think> blocks
Tool calling — structured function calls when given tool definitions
Instruction following — concise answers, format constraints, system prompt adherence

Eval Results

Benchmark	Score
Diverse stochastic eval (38 tests)	86.8%
Vision probe (rendered text: "42")	PASS
Tool calling (structured tool_calls)	PASS
Thinking (produces thinking field)	PASS

Training Details

Method: LoRA SFT (r=32, alpha=64, LR=1e-4, 1 epoch), vision spliced from base
Hardware: 3x NVIDIA A100 80GB (DDP)
Vision preservation: Trained text weights spliced into base multimodal model (Qwen3_5ForConditionalGeneration). All 333 vision tensors are byte-for-byte identical to the base model. Vision was not fine-tuned.
Data: Additive mix of 4043 samples:
- bespokelabs/Bespoke-Stratos-17k — DeepSeek-R1 reasoning traces (2000 samples)
- allenai/tulu-3-sft-mixture — instruction diversity (1358 samples)
- Open-Orca/SlimOrca — curated GPT-4 instructions (451 samples)
- PrimeIntellect/SYNTHETIC-1-SFT-Data — verified math/code/STEM (312 samples)
- Static format-constrained, conversational, and code examples (remaining)
Best eval loss: 0.4950

Training Suite

Full training pipeline, evaluation scripts, and documentation: robit-man/fine_tuning_suite

Usage

from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "cudabenchmarktest/qwen3.5-9b-r7-research-vision",
    torch_dtype="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("cudabenchmarktest/qwen3.5-9b-r7-research-vision")

# Text-only
messages = [{"role": "user", "content": "What is 2+2?"}]
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF Export Pipeline

To convert this model to a vision-capable GGUF for Ollama:

Filter linear_attn from LoRA adapter
convert_lora_to_gguf.py -> LoRA GGUF
llama-export-lora -> merge into base qwen3.5:9b Q4_K_M GGUF (preserves 883 tensors including vision)
llama-quantize -> Q4_K_M
ollama create with RENDERER qwen3.5 + PARSER qwen3.5

See INSTRUCTIONS.md for full details.