Qwen3.5-35B-A3B — Verbose Image Captioning (SFT v2)

Fine-tuned Qwen3.5-35B-A3B for detailed image captioning with structured XML spatial reasoning.

What it does

Given an image, the model generates:

A <think> block with XML spatial analysis (<visual_parsing_graph>) covering scene composition, entity mapping, interactions, and lighting
A detailed flowing prose description

The XML reasoning acts as chain-of-thought, grounding the prose in what the model actually observes.

Files

File	Size	Description
`qwen35-sft-v2-f16.gguf`	65 GB	Full precision (f16)
`qwen35-sft-v2-Q4_K_M.gguf`	20 GB	Quantized (Q4_K_M, 4.88 BPW)
`qwen35-sft-v2-mmproj-f16.gguf`	858 MB	Vision projector (required)

Usage (llama.cpp)

llama-mtmd-cli \
  -m qwen35-sft-v2-Q4_K_M.gguf \
  --mmproj qwen35-sft-v2-mmproj-f16.gguf \
  -p "Describe this image in extremely high detail. Include a structured spatial analysis as XML followed by a flowing prose description." \
  --image your_image.jpg \
  -ngl 99 -n 2048 --repeat-penalty 1.15

Note: --repeat-penalty 1.15 is recommended to prevent tag repetition loops in the XML output.

Training details

Base model: Qwen3.5-35B-A3B (MoE, 35B total / 3B active)
Method: LoRA (r=32, alpha=32) via Unsloth
Trainable params: 1.9B / 37B (5.15%)
Dataset: ~8.9k image-caption pairs with XML spatial analysis + prose
Epochs: 2
Final loss: 0.719
Precision: bf16
Hardware: NVIDIA H200 (141 GB)
Training time: ~2 hours

Prompt format

The model responds best to this specific instruction:

Describe this image in extremely high detail. Include a structured spatial analysis as XML followed by a flowing prose description.

Other prompts will still produce captions, but without the XML structure.

Limitations

Tag repetition loops can occur without repeat penalty — use --repeat-penalty 1.15
May occasionally hallucinate character names or minor color details
Optimized for furry/anthro art — performance on other domains not tested

Downloads last month: 33

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TigerKay/qwen35-sft-v2-gguf

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Adapter

(25)

this model