Qwen Visual Design Judge

A fine-tuned Qwen3.5-0.8B model that judges visual design quality between image pairs.

🎯 Performance

Metric	Score
Overall Accuracy	82%
High agreement pairs (≥80%)	90.9%
Low agreement pairs (<80%)	79.5%

Matches GPT-4.1 performance while being ~1000x cheaper to run locally!

📊 Training

Base model: Qwen/Qwen3.5-0.8B
Training data: 40K synthetic preference pairs labeled by GPT-4.1
Domains: Landing pages, websites, mobile UI, graphics
Epochs: 1
Hardware: NVIDIA T4 GPU (~13 hours)

🚀 Usage

import torch
from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "DillonNys/qwen-visual-design-judge",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained("DillonNys/qwen-visual-design-judge")

def judge_pair(img_a: str, img_b: str) -> str:
    prompt = """You are an expert visual design judge. Compare these two images and determine which has better visual design quality.

Consider: layout, typography, color harmony, visual hierarchy, spacing, and overall aesthetic appeal.

Respond with ONLY "A" or "B" to indicate the better design."""

    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "text", "text": "\n\nImage A:"},
            {"type": "image", "image": img_a},
            {"type": "text", "text": "\n\nImage B:"},
            {"type": "image", "image": img_b},
            {"type": "text", "text": "\n\nWhich is better? Answer A or B:"},
        ],
    }]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=8, do_sample=False)
    
    response = processor.decode(output_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    return "A" if "A" in response.upper() else "B"

# Example
winner = judge_pair("design_a.png", "design_b.png")
print(f"Better design: {winner}")

📝 Citation

If you use this model, please cite:

@misc{qwen-visual-design-judge,
  author = {Dillon Nys},
  title = {Qwen Visual Design Judge},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/DillonNys/qwen-visual-design-judge}
}

🙏 Acknowledgments

Qwen team for the excellent base model
OpenAI for GPT-4.1 used in synthetic labeling
The Vibe Arena community for preference data

Downloads last month: 19

Safetensors

Model size

0.9B params

Tensor type

F32

Model tree for DillonNys/qwen-visual-design-judge

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(159)

this model