AdResyze LoRA β Ad Layout Understanding
Fine-tuned Qwen2.5-VL-7B-Instruct with LoRA for advertisement layout understanding. Given any ad image, the model returns a structured JSON describing all visual elements, their bounding boxes, dominant colors, aspect ratio, and platform guess.
Model Description
This LoRA adapter teaches Qwen2.5-VL to understand the anatomy of advertisements β identifying logos, headlines, CTAs, product images, and backgrounds with their precise coordinates.
Built as the AI core of AdResyze by Gopi Aitham, Scaleup-Solutions.in β a tool that takes one ad image and produces 5 platform-ready formats (Instagram Story, Instagram Feed, Facebook Feed, LinkedIn Banner, Google Display) in under 60 seconds.
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-VL-7B-Instruct |
| Fine-tuning method | LoRA (Low-Rank Adaptation) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA target modules | q_proj, v_proj |
| Training samples | 260 |
| Eval samples | 29 |
| Epochs | 3 |
| Learning rate | 2e-4 |
| Training time | ~21 minutes |
| Hardware | RunPod A100 SXM 80GB |
| Framework | LLaMA-Factory |
| Initial loss | 0.2359 |
| Final loss | 0.1375 |
| Eval success rate | 9/10 |
Dataset
Trained on 289 annotated Indian brand advertisements collected from Meta Ad Library. Annotations include:
- Element type:
logo | headline | cta | product | background | other - Bounding boxes:
[x1, y1, x2, y2]format - Dominant colors (hex)
- Aspect ratio
- Platform guess
Dataset (annotation JSONs only, no images): builditwithgk/adresyze-ad-layouts
Model Output
For any input ad image, the model returns:
{
"elements": [
{"type": "headline", "bbox": [50, 398, 507, 445], "priority": 1, "must_preserve": true},
{"type": "logo", "bbox": [156, 305, 183, 341], "priority": 1, "must_preserve": true},
{"type": "cta", "bbox": [112, 109, 530, 368], "priority": 1, "must_preserve": true},
{"type": "background","bbox": [0, 0, 588, 588], "priority": 1, "must_preserve": true}
],
"dominant_colors": ["#7B2FBE", "#ffffff"],
"aspect_ratio": "1:1",
"platform_guess": "instagram"
}
bbox format: [x1, y1, x2, y2] β top-left to bottom-right coordinates.
Usage
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from peft import PeftModel
from PIL import Image
import torch, json
# Load base model + LoRA
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(model, "builditwithgk/adresyze-lora")
model.eval()
processor = Qwen2_5_VLProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Load your ad image
image = Image.open("your_ad.jpg").convert("RGB")
PROMPT = """Analyze this advertisement image. Return ONLY a JSON object:
{
"elements": [{"type": "logo|product|cta|headline|background|other",
"bbox": [x1, y1, x2, y2], "priority": 1, "must_preserve": true}],
"dominant_colors": ["#hex1", "#hex2"],
"aspect_ratio": "1:1|9:16|4:5|1.91:1|other",
"platform_guess": "instagram|facebook|linkedin|google_display|other"
}
Return ONLY the JSON. No explanation."""
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT}
]}]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = processor.batch_decode(output, skip_special_tokens=True)[0]
response = response.replace("```json", "").replace("```", "")
if "assistant" in response:
response = response.split("assistant")[-1]
result = json.loads(response[response.find("{"):response.rfind("}")+1])
print(result)
Requirements
pip install transformers==4.56.1 peft accelerate qwen-vl-utils pillow torch
Limitations
- Trained on 260 samples β works best on standard digital ad formats
- Bounding boxes are approximate β suitable for layout understanding, not pixel-perfect segmentation
- Optimized for Indian brand ad conventions
- Base model license applies: Qwen2.5-VL is Apache 2.0
Full Pipeline
This model is the AI core of AdResyze β a complete ad resizing pipeline:
Ad Image β Qwen2.5-VL + LoRA β Layout JSON β Pillow Executor β 5 Platform Formats
The executor scales the entire composition intelligently per platform, with 3 background fill styles (solid color, blur stretch, mirror edge) and SSIM-based logo safety checking.
Pipeline stack: Modal (inference) β n8n (orchestration) β FastAPI/Pillow (executor) β Supabase (storage + DB)
Citation
If you use this model or dataset in your work, please cite:
@misc{adresyze2026,
author = {builditwithgk},
title = {AdResyze: Fine-tuned Qwen2.5-VL for Advertisement Layout Understanding},
year = {2026},
url = {https://huggingface.co/builditwithgk/adresyze-lora}
}
License
Apache 2.0 β same as the base model. Commercial use permitted.
- Downloads last month
- 132
Model tree for builditwithgk/adresyze-lora
Base model
Qwen/Qwen2.5-VL-7B-Instruct