Qwen3.5-35B-A3B โ€” Verbose Image Captioning (SFT v2)

Fine-tuned Qwen3.5-35B-A3B for detailed image captioning with structured XML spatial reasoning.

What it does

Given an image, the model generates:

  1. A <think> block with XML spatial analysis (<visual_parsing_graph>) covering scene composition, entity mapping, interactions, and lighting
  2. A detailed flowing prose description

The XML reasoning acts as chain-of-thought, grounding the prose in what the model actually observes.

Files

File Size Description
qwen35-sft-v2-f16.gguf 65 GB Full precision (f16)
qwen35-sft-v2-Q4_K_M.gguf 20 GB Quantized (Q4_K_M, 4.88 BPW)
qwen35-sft-v2-mmproj-f16.gguf 858 MB Vision projector (required)

Usage (llama.cpp)

llama-mtmd-cli \
  -m qwen35-sft-v2-Q4_K_M.gguf \
  --mmproj qwen35-sft-v2-mmproj-f16.gguf \
  -p "Describe this image in extremely high detail. Include a structured spatial analysis as XML followed by a flowing prose description." \
  --image your_image.jpg \
  -ngl 99 -n 2048 --repeat-penalty 1.15

Note: --repeat-penalty 1.15 is recommended to prevent tag repetition loops in the XML output.

Training details

  • Base model: Qwen3.5-35B-A3B (MoE, 35B total / 3B active)
  • Method: LoRA (r=32, alpha=32) via Unsloth
  • Trainable params: 1.9B / 37B (5.15%)
  • Dataset: ~8.9k image-caption pairs with XML spatial analysis + prose
  • Epochs: 2
  • Final loss: 0.719
  • Precision: bf16
  • Hardware: NVIDIA H200 (141 GB)
  • Training time: ~2 hours

Prompt format

The model responds best to this specific instruction:

Describe this image in extremely high detail. Include a structured spatial analysis as XML followed by a flowing prose description.

Other prompts will still produce captions, but without the XML structure.

Limitations

  • Tag repetition loops can occur without repeat penalty โ€” use --repeat-penalty 1.15
  • May occasionally hallucinate character names or minor color details
  • Optimized for furry/anthro art โ€” performance on other domains not tested
Downloads last month
33
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for TigerKay/qwen35-sft-v2-gguf

Adapter
(25)
this model