⚠️ CRITICAL: Ollama Inference Flag Required

If you serve this model via Ollama with the qwen3.5 RENDERER (the standard recommended setup), you MUST pass "think": false in the /api/chat request body for chat / instruction following / tool use.

curl -X POST http://localhost:11434/api/chat \
  -d '{"model": "...", "think": false, "messages": [...], "stream": false}'

Without this flag, the renderer auto-injects <think> tags into every chat completion. On longer prompts the model can stay inside the <think> block past the response budget, never emit </think>, and produce zero answer tokens on 25-46% of requests.

Set think: true (or omit) only when you DO want chain-of-thought reasoning (math, planning, complex multi-step). This is Qwen3 dual-mode operation per https://qwenlm.github.io/blog/qwen3/.

See the dataset cudabenchmarktest/r9-research-framework _OLLAMA_INFERENCE_WARNING.md for the full explanation.


Qwen3.5-9B R7 Research Vision (FP16)

Fine-tuned Qwen3.5-9B with distilled reasoning and full vision support. 760 tensors (427 text + 333 vision) — vision tower preserved byte-for-byte from base via splice. For the text-only version, see cudabenchmarktest/qwen3.5-9b-r7-research.

This is the pre-GGUF FP16 safetensors checkpoint. For quantized GGUF versions ready for Ollama, see robit/qwen3.5-9b-r7-research-vision on Ollama.

Capabilities

  • Vision — image understanding (reads text, describes scenes, answers visual questions)
  • Thinking — produces structured reasoning in <think> blocks
  • Tool calling — structured function calls when given tool definitions
  • Instruction following — concise answers, format constraints, system prompt adherence

Eval Results

Benchmark Score
Diverse stochastic eval (38 tests) 86.8%
Vision probe (rendered text: "42") PASS
Tool calling (structured tool_calls) PASS
Thinking (produces thinking field) PASS

Training Details

  • Method: LoRA SFT (r=32, alpha=64, LR=1e-4, 1 epoch), vision spliced from base
  • Hardware: 3x NVIDIA A100 80GB (DDP)
  • Vision preservation: Trained text weights spliced into base multimodal model (Qwen3_5ForConditionalGeneration). All 333 vision tensors are byte-for-byte identical to the base model. Vision was not fine-tuned.
  • Data: Additive mix of 4043 samples:
  • Best eval loss: 0.4950

Training Suite

Full training pipeline, evaluation scripts, and documentation: robit-man/fine_tuning_suite

Usage

from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "cudabenchmarktest/qwen3.5-9b-r7-research-vision",
    torch_dtype="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("cudabenchmarktest/qwen3.5-9b-r7-research-vision")

# Text-only
messages = [{"role": "user", "content": "What is 2+2?"}]
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF Export Pipeline

To convert this model to a vision-capable GGUF for Ollama:

  1. Filter linear_attn from LoRA adapter
  2. convert_lora_to_gguf.py -> LoRA GGUF
  3. llama-export-lora -> merge into base qwen3.5:9b Q4_K_M GGUF (preserves 883 tensors including vision)
  4. llama-quantize -> Q4_K_M
  5. ollama create with RENDERER qwen3.5 + PARSER qwen3.5

See INSTRUCTIONS.md for full details.

License

Apache 2.0 (inherited from Qwen3.5-9B). Training data licenses vary by source.

Downloads last month
34
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cudabenchmarktest/qwen3.5-9b-r7-research-vision

Finetuned
Qwen/Qwen3.5-9B
Adapter
(123)
this model

Datasets used to train cudabenchmarktest/qwen3.5-9b-r7-research-vision

Collection including cudabenchmarktest/qwen3.5-9b-r7-research-vision