Qwen Visual Design Judge

A fine-tuned Qwen3.5-0.8B model that judges visual design quality between image pairs.

🎯 Performance

Metric Score
Overall Accuracy 82%
High agreement pairs (β‰₯80%) 90.9%
Low agreement pairs (<80%) 79.5%

Matches GPT-4.1 performance while being ~1000x cheaper to run locally!

πŸ“Š Training

  • Base model: Qwen/Qwen3.5-0.8B
  • Training data: 40K synthetic preference pairs labeled by GPT-4.1
  • Domains: Landing pages, websites, mobile UI, graphics
  • Epochs: 1
  • Hardware: NVIDIA T4 GPU (~13 hours)

πŸš€ Usage

import torch
from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "DillonNys/qwen-visual-design-judge",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained("DillonNys/qwen-visual-design-judge")

def judge_pair(img_a: str, img_b: str) -> str:
    prompt = """You are an expert visual design judge. Compare these two images and determine which has better visual design quality.

Consider: layout, typography, color harmony, visual hierarchy, spacing, and overall aesthetic appeal.

Respond with ONLY "A" or "B" to indicate the better design."""

    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "text", "text": "\n\nImage A:"},
            {"type": "image", "image": img_a},
            {"type": "text", "text": "\n\nImage B:"},
            {"type": "image", "image": img_b},
            {"type": "text", "text": "\n\nWhich is better? Answer A or B:"},
        ],
    }]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=8, do_sample=False)
    
    response = processor.decode(output_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    return "A" if "A" in response.upper() else "B"

# Example
winner = judge_pair("design_a.png", "design_b.png")
print(f"Better design: {winner}")

πŸ“ Citation

If you use this model, please cite:

@misc{qwen-visual-design-judge,
  author = {Dillon Nys},
  title = {Qwen Visual Design Judge},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/DillonNys/qwen-visual-design-judge}
}

πŸ™ Acknowledgments

  • Qwen team for the excellent base model
  • OpenAI for GPT-4.1 used in synthetic labeling
  • The Vibe Arena community for preference data
Downloads last month
19
Safetensors
Model size
0.9B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DillonNys/qwen-visual-design-judge

Finetuned
(159)
this model