Qwen3.5-35B-A3B โ Verbose Image Captioning (SFT v2)
Fine-tuned Qwen3.5-35B-A3B for detailed image captioning with structured XML spatial reasoning.
What it does
Given an image, the model generates:
- A
<think>block with XML spatial analysis (<visual_parsing_graph>) covering scene composition, entity mapping, interactions, and lighting - A detailed flowing prose description
The XML reasoning acts as chain-of-thought, grounding the prose in what the model actually observes.
Files
| File | Size | Description |
|---|---|---|
qwen35-sft-v2-f16.gguf |
65 GB | Full precision (f16) |
qwen35-sft-v2-Q4_K_M.gguf |
20 GB | Quantized (Q4_K_M, 4.88 BPW) |
qwen35-sft-v2-mmproj-f16.gguf |
858 MB | Vision projector (required) |
Usage (llama.cpp)
llama-mtmd-cli \
-m qwen35-sft-v2-Q4_K_M.gguf \
--mmproj qwen35-sft-v2-mmproj-f16.gguf \
-p "Describe this image in extremely high detail. Include a structured spatial analysis as XML followed by a flowing prose description." \
--image your_image.jpg \
-ngl 99 -n 2048 --repeat-penalty 1.15
Note:
--repeat-penalty 1.15is recommended to prevent tag repetition loops in the XML output.
Training details
- Base model: Qwen3.5-35B-A3B (MoE, 35B total / 3B active)
- Method: LoRA (r=32, alpha=32) via Unsloth
- Trainable params: 1.9B / 37B (5.15%)
- Dataset: ~8.9k image-caption pairs with XML spatial analysis + prose
- Epochs: 2
- Final loss: 0.719
- Precision: bf16
- Hardware: NVIDIA H200 (141 GB)
- Training time: ~2 hours
Prompt format
The model responds best to this specific instruction:
Describe this image in extremely high detail. Include a structured spatial analysis as XML followed by a flowing prose description.
Other prompts will still produce captions, but without the XML structure.
Limitations
- Tag repetition loops can occur without repeat penalty โ use
--repeat-penalty 1.15 - May occasionally hallucinate character names or minor color details
- Optimized for furry/anthro art โ performance on other domains not tested
- Downloads last month
- 33
Hardware compatibility
Log In to add your hardware
4-bit
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support