AdResyze LoRA β€” Ad Layout Understanding

Fine-tuned Qwen2.5-VL-7B-Instruct with LoRA for advertisement layout understanding. Given any ad image, the model returns a structured JSON describing all visual elements, their bounding boxes, dominant colors, aspect ratio, and platform guess.

Model Description

This LoRA adapter teaches Qwen2.5-VL to understand the anatomy of advertisements β€” identifying logos, headlines, CTAs, product images, and backgrounds with their precise coordinates.

Built as the AI core of AdResyze by Gopi Aitham, Scaleup-Solutions.in β€” a tool that takes one ad image and produces 5 platform-ready formats (Instagram Story, Instagram Feed, Facebook Feed, LinkedIn Banner, Google Display) in under 60 seconds.

Training Details

Parameter Value
Base model Qwen/Qwen2.5-VL-7B-Instruct
Fine-tuning method LoRA (Low-Rank Adaptation)
LoRA rank 16
LoRA alpha 32
LoRA target modules q_proj, v_proj
Training samples 260
Eval samples 29
Epochs 3
Learning rate 2e-4
Training time ~21 minutes
Hardware RunPod A100 SXM 80GB
Framework LLaMA-Factory
Initial loss 0.2359
Final loss 0.1375
Eval success rate 9/10

Dataset

Trained on 289 annotated Indian brand advertisements collected from Meta Ad Library. Annotations include:

  • Element type: logo | headline | cta | product | background | other
  • Bounding boxes: [x1, y1, x2, y2] format
  • Dominant colors (hex)
  • Aspect ratio
  • Platform guess

Dataset (annotation JSONs only, no images): builditwithgk/adresyze-ad-layouts

Model Output

For any input ad image, the model returns:

{
  "elements": [
    {"type": "headline", "bbox": [50, 398, 507, 445], "priority": 1, "must_preserve": true},
    {"type": "logo",     "bbox": [156, 305, 183, 341], "priority": 1, "must_preserve": true},
    {"type": "cta",      "bbox": [112, 109, 530, 368], "priority": 1, "must_preserve": true},
    {"type": "background","bbox": [0, 0, 588, 588],   "priority": 1, "must_preserve": true}
  ],
  "dominant_colors": ["#7B2FBE", "#ffffff"],
  "aspect_ratio": "1:1",
  "platform_guess": "instagram"
}

bbox format: [x1, y1, x2, y2] β€” top-left to bottom-right coordinates.

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from peft import PeftModel
from PIL import Image
import torch, json

# Load base model + LoRA
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, "builditwithgk/adresyze-lora")
model.eval()

processor = Qwen2_5_VLProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Load your ad image
image = Image.open("your_ad.jpg").convert("RGB")

PROMPT = """Analyze this advertisement image. Return ONLY a JSON object:
{
  "elements": [{"type": "logo|product|cta|headline|background|other",
    "bbox": [x1, y1, x2, y2], "priority": 1, "must_preserve": true}],
  "dominant_colors": ["#hex1", "#hex2"],
  "aspect_ratio": "1:1|9:16|4:5|1.91:1|other",
  "platform_guess": "instagram|facebook|linkedin|google_display|other"
}
Return ONLY the JSON. No explanation."""

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": PROMPT}
]}]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, do_sample=False)

response = processor.batch_decode(output, skip_special_tokens=True)[0]
response = response.replace("```json", "").replace("```", "")
if "assistant" in response:
    response = response.split("assistant")[-1]

result = json.loads(response[response.find("{"):response.rfind("}")+1])
print(result)

Requirements

pip install transformers==4.56.1 peft accelerate qwen-vl-utils pillow torch

Limitations

  • Trained on 260 samples β€” works best on standard digital ad formats
  • Bounding boxes are approximate β€” suitable for layout understanding, not pixel-perfect segmentation
  • Optimized for Indian brand ad conventions
  • Base model license applies: Qwen2.5-VL is Apache 2.0

Full Pipeline

This model is the AI core of AdResyze β€” a complete ad resizing pipeline:

Ad Image β†’ Qwen2.5-VL + LoRA β†’ Layout JSON β†’ Pillow Executor β†’ 5 Platform Formats

The executor scales the entire composition intelligently per platform, with 3 background fill styles (solid color, blur stretch, mirror edge) and SSIM-based logo safety checking.

Pipeline stack: Modal (inference) β†’ n8n (orchestration) β†’ FastAPI/Pillow (executor) β†’ Supabase (storage + DB)

Citation

If you use this model or dataset in your work, please cite:

@misc{adresyze2026,
  author = {builditwithgk},
  title  = {AdResyze: Fine-tuned Qwen2.5-VL for Advertisement Layout Understanding},
  year   = {2026},
  url    = {https://huggingface.co/builditwithgk/adresyze-lora}
}

License

Apache 2.0 β€” same as the base model. Commercial use permitted.

Downloads last month
132
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for builditwithgk/adresyze-lora

Adapter
(239)
this model