AdResyze LoRA — Ad Layout Understanding

Fine-tuned Qwen2.5-VL-7B-Instruct with LoRA for advertisement layout understanding. Given any ad image, the model returns a structured JSON describing all visual elements, their bounding boxes, dominant colors, aspect ratio, and platform guess.

Model Description

This LoRA adapter teaches Qwen2.5-VL to understand the anatomy of advertisements — identifying logos, headlines, CTAs, product images, and backgrounds with their precise coordinates.

Built as the AI core of AdResyze by Gopi Aitham, Scaleup-Solutions.in — a tool that takes one ad image and produces 5 platform-ready formats (Instagram Story, Instagram Feed, Facebook Feed, LinkedIn Banner, Google Display) in under 60 seconds.

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-VL-7B-Instruct
Fine-tuning method	LoRA (Low-Rank Adaptation)
LoRA rank	16
LoRA alpha	32
LoRA target modules	q_proj, v_proj
Training samples	260
Eval samples	29
Epochs	3
Learning rate	2e-4
Training time	~21 minutes
Hardware	RunPod A100 SXM 80GB
Framework	LLaMA-Factory
Initial loss	0.2359
Final loss	0.1375
Eval success rate	9/10

Dataset

Trained on 289 annotated Indian brand advertisements collected from Meta Ad Library. Annotations include:

Element type: logo | headline | cta | product | background | other
Bounding boxes: [x1, y1, x2, y2] format
Dominant colors (hex)
Aspect ratio
Platform guess

Dataset (annotation JSONs only, no images): builditwithgk/adresyze-ad-layouts

Model Output

For any input ad image, the model returns:

{
  "elements": [
    {"type": "headline", "bbox": [50, 398, 507, 445], "priority": 1, "must_preserve": true},
    {"type": "logo",     "bbox": [156, 305, 183, 341], "priority": 1, "must_preserve": true},
    {"type": "cta",      "bbox": [112, 109, 530, 368], "priority": 1, "must_preserve": true},
    {"type": "background","bbox": [0, 0, 588, 588],   "priority": 1, "must_preserve": true}
  ],
  "dominant_colors": ["#7B2FBE", "#ffffff"],
  "aspect_ratio": "1:1",
  "platform_guess": "instagram"
}

bbox format: [x1, y1, x2, y2] — top-left to bottom-right coordinates.

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from peft import PeftModel
from PIL import Image
import torch, json

# Load base model + LoRA
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, "builditwithgk/adresyze-lora")
model.eval()

processor = Qwen2_5_VLProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Load your ad image
image = Image.open("your_ad.jpg").convert("RGB")

PROMPT = """Analyze this advertisement image. Return ONLY a JSON object:
{
  "elements": [{"type": "logo|product|cta|headline|background|other",
    "bbox": [x1, y1, x2, y2], "priority": 1, "must_preserve": true}],
  "dominant_colors": ["#hex1", "#hex2"],
  "aspect_ratio": "1:1|9:16|4:5|1.91:1|other",
  "platform_guess": "instagram|facebook|linkedin|google_display|other"
}
Return ONLY the JSON. No explanation."""

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": PROMPT}
]}]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, do_sample=False)

response = processor.batch_decode(output, skip_special_tokens=True)[0]
response = response.replace("```json", "").replace("```", "")
if "assistant" in response:
    response = response.split("assistant")[-1]

result = json.loads(response[response.find("{"):response.rfind("}")+1])
print(result)

Requirements

pip install transformers==4.56.1 peft accelerate qwen-vl-utils pillow torch

Limitations

Trained on 260 samples — works best on standard digital ad formats
Bounding boxes are approximate — suitable for layout understanding, not pixel-perfect segmentation
Optimized for Indian brand ad conventions
Base model license applies: Qwen2.5-VL is Apache 2.0

Full Pipeline

This model is the AI core of AdResyze — a complete ad resizing pipeline:

Ad Image → Qwen2.5-VL + LoRA → Layout JSON → Pillow Executor → 5 Platform Formats

The executor scales the entire composition intelligently per platform, with 3 background fill styles (solid color, blur stretch, mirror edge) and SSIM-based logo safety checking.

Pipeline stack: Modal (inference) → n8n (orchestration) → FastAPI/Pillow (executor) → Supabase (storage + DB)

Citation

If you use this model or dataset in your work, please cite:

@misc{adresyze2026,
  author = {builditwithgk},
  title  = {AdResyze: Fine-tuned Qwen2.5-VL for Advertisement Layout Understanding},
  year   = {2026},
  url    = {https://huggingface.co/builditwithgk/adresyze-lora}
}

License

Apache 2.0 — same as the base model. Commercial use permitted.

Downloads last month: 132

Model tree for builditwithgk/adresyze-lora

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(239)

this model