Qwen-Image-2512 LoRA for Professional Advertisement Generation

80 Hours. One RTX 3090. Zero Cloud Compute.


Method Base GPU Dataset License Competition


Metric Value
Training Time 80 hours on a single consumer GPU
Base Model Qwen-Image-2512 (20.5B parameters)
Dataset 12,917 curated images + 25,834 VLM-enriched captions
GPU Hardware NVIDIA RTX 3090 β€” 24 GB VRAM, consumer-grade ($1,500)
Adapter Size 1.7 GB (428M trainable params, 4.02% of transformer)
Resolution 2-stage curriculum: 512px β†’ 768px
Techniques DoRA, rsLoRA, Min-SNR-gamma, 4-bit QLoRA, embedding pre-caching

Rayyan Ahmed β€” GitHub | LinkedIn | ModelScope | HuggingFace

Part of the MarketMind AI Marketing Platform


Competition Entry: 2026 Qwen-Image LoRA Training Competition β€” Track 1: AI for Production (branding, e-commerce, industrial design). Hosted by Tongyi Lab + ModelScope.


TL;DR

A 1.7 GB LoRA adapter that transforms the 20.5 billion parameter Qwen-Image-2512 into a professional advertisement and product photography engine. Trained on 12,917 curated Pexels images across 40 advertising categories using a two-stage resolution curriculum (512px β†’ 768px) on a single NVIDIA RTX 3090 β€” a consumer GPU with just 24 GB of VRAM. No cloud compute, no A100, no H100. Just engineering.

The entire pipeline β€” dataset curation, VLM caption enrichment, multi-stage training, evaluation, and deployment β€” was designed from scratch to operate within the hard constraints of consumer hardware, accumulating 80+ hours of continuous GPU training.


Showcase Gallery

All images generated at 768px with the LoRA adapter using True CFG guidance (scale=4.0, 100 inference steps).

Product Photography Across Industries


Luxury Fragrance
Crystal perfume bottle, dramatic Rembrandt lighting

Technology
Flagship smartphone, clean studio lighting

Food & Beverage
Artisan cappuccino, warm morning tones

Automotive
Luxury SUV, golden hour mountain road

Fashion & Jewelry
Luxury timepiece, moody dramatic lighting

Sports & Fitness
Performance running shoe, dynamic floating shot

Skincare & Beauty
Glass serum bottle, marble surface, natural light

Real Estate
Penthouse interior, city skyline view

Seed Diversity β€” Same Prompt, Three Different Seeds

Demonstrating the model's ability to generate diverse, high-quality variations from the same prompt.

Seed 42 Seed 2026 Seed 777
Luxury Perfume
Technology
Fitness
Real Estate

Ad Overlays β€” Production-Ready Output


Tech Ad

Luxury Ad

Food Ad

Auto Ad

LoRA vs Base Model Comparison

Generation Parameters
  • Resolution: 768Γ—768 (showcase), 100 inference steps
  • True CFG Scale: 4.0 with negative prompt
  • Seeds: 42, 2026, 777 (3 seeds Γ— 8 prompts = 24 LoRA showcase images)
  • Precision: Full bf16 with pre-merged DoRA weights
  • Hardware: Single RTX 3090 (24 GB VRAM) + 64 GB RAM
  • LoRA Scale: 0.65

The Engineering Challenge

Fitting 20.5 Billion Parameters on a $1,500 Consumer GPU

Qwen-Image-2512 has a 20.5 billion parameter S3-DiT transformer. Loading it in bf16 requires 41 GB of VRAM. My RTX 3090 has 24 GB.

This isn't a "just reduce batch size" problem. Every architectural decision β€” from training strategy to inference pipeline β€” was dictated by this hard constraint. Here's how I solved it:

Challenge Solution Impact
41 GB model vs 24 GB VRAM 4-bit NF4 quantization during training Fits in ~10 GB VRAM
Text encoder takes 6 GB Pre-cache all 25,834 embeddings to disk, then unload encoder Freed 6 GB for larger LoRA rank
Need bf16 for DoRA inference Custom shard-by-shard LoRA merge + disk offload pipeline Full quality on consumer hardware
80 hours of training Graceful SIGTERM/SIGINT checkpointing + resume system Survived power events & crashes
Windows pagefile limits Custom Python mmap for safetensors + commit-aware device mapping Zero-copy model loading
Mixed resolution dataset Two-stage curriculum (512px β†’ 768px) 2.25Γ— faster than 768px-only training
CPU RAM pressure during inference Commit-aware device mapping via ctypes MEMORYSTATUSEX Dynamic GPU/CPU/disk layer placement

80 Hours of Continuous GPU Time

This was not a quick weekend project. The training ran for 80 continuous hours on a single GPU:

Stage 1: 512px Resolution — Learning Composition
  Steps:     10,000 (~3.4 epochs over 12,917 images)
  Speed:     ~22.5 seconds/step
  Duration:  ~63 hours
  Final loss: 0.96

Stage 2: 768px Resolution — Sharpening Detail
  Steps:     1,500 (fine-tuning from step 10,000)
  Speed:     ~36 seconds/step
  Duration:  ~17 hours
  Final loss: 0.17

Total:       11,500 steps | ~80 hours | Single RTX 3090

For context: this same training would take ~20 hours on an A100 or ~12 hours on an H100. I didn't have access to either. Every optimization mattered.

The VRAM Budget

Component                         VRAM Usage
─────────────────────────────────────────────
Transformer (4-bit NF4)           ~10.0 GB
LoRA adapters (720 layers)         ~1.5 GB
VAE (bf16)                         ~0.5 GB
Optimizer states (AdamW 8-bit)     ~3.0 GB
Activations (grad checkpoint)      ~4.0 GB
Overhead & buffers                 ~3.0 GB
─────────────────────────────────────────────
TOTAL                             ~22.0 GB / 24 GB available
Text encoder: UNLOADED (embeddings pre-cached to disk, saving 6 GB)

8 Technical Innovations

# Innovation What & Why
1 DoRA + rsLoRA Weight-Decomposed LoRA closes the gap with full fine-tuning. rsLoRA stabilizes high-rank training at r=64
2 12 target modules per block Most LoRAs target 4 attention projections. This adapter targets all 12 linear layers (attention + MLP, both image and text streams) for deeper style adaptation
3 Two-stage resolution curriculum 10K steps at 512px learns composition; 1.5K steps at 768px sharpens detail. 2.25Γ— more efficient than full-resolution training
4 Min-SNR-gamma=5.0 Flow matching loss weighting that prevents wasting gradient updates on trivially noisy timesteps
5 VLM dual-caption enrichment Qwen3-VL-8B generated 25,834 captions: detailed paragraphs + concise tags. Stochastic 70/30 mixing during training for robust text alignment
6 Adaptive rank pattern r=64 for attention modules, r=32 for MLP modules. MLPs need less capacity β€” saving VRAM where it matters least
7 Text embedding pre-caching Encode all captions once, persist to disk, unload the 6 GB text encoder. Train indefinitely on cached embeddings
8 Custom 12,917-image dataset Built from scratch via Pexels API β€” filtered, deduplicated, VLM-enriched, and published as an open dataset

The DoRA Discovery

After 80 hours of training, I loaded the adapter at 4-bit quantization. The output was garbage. I tried 8-bit β€” still garbage. After two days of debugging, I tried full bf16 and the output was perfect.

DoRA breaks under quantization. Its magnitude-direction weight decomposition compounds rounding errors from both components, corrupting the learned adaptation. Standard LoRA tolerates quantization fine; DoRA does not.

This is not documented anywhere in PEFT, the DoRA paper, or the HuggingFace ecosystem. I'm documenting it here so others don't waste two days.

Inference Method Quality Explanation
bf16 + PeftModel Professional, sharp Full DoRA precision preserved
8-bit quantized Garbled Magnitude-direction decomposition corrupted
4-bit quantized Garbled Same issue, worse
merge_and_unload (any quant) Garbled PEFT warns "rounding errors"

Evaluation Results

Quantitative Metrics (n=20 per model, identical prompts & seeds)

Model Aesthetic Score CLIP Score Aesthetic Std
Base Qwen-Image-2512 (8-bit + True CFG) 5.50 Β± 1.56 0.1849 Β± 0.061 1.56
+ LoRA Adapter (bf16, this model) 5.56 Β± 1.06 0.1188 Β± 0.038 1.06

Key findings:

  • Aesthetic score improved (+1.1%) β€” the LoRA generates more visually appealing ad imagery
  • Consistency dramatically improved β€” standard deviation dropped from 1.56 β†’ 1.06 (32% reduction). The adapter produces reliably good output rather than high-variance results
  • CLIP score trade-off β€” this is expected and intentional. The LoRA was trained to produce advertising-style imagery which prioritizes aesthetic impact and mood over literal text-to-image correspondence. A perfume ad doesn't literally depict "dramatic Rembrandt lighting" β€” it creates Rembrandt lighting
  • Lower CLIP variance (0.038 vs 0.061) confirms more consistent style adherence across diverse advertising categories

A note on metrics for domain-specific models: CLIP score measures general text-image similarity using ViT-L-14. For specialized models like advertisement generators, aesthetic quality and output consistency are more meaningful production metrics than raw text alignment. The showcase gallery above demonstrates actual output quality across 8 industries and 3 seeds.


Dataset: Pexels Advertisement Photography

Open dataset: RayyanAhmed9477/pexels-advertisement-photography (12,917 images, 25,834 captions)

Built entirely from scratch for this project:

Pexels API (40 targeted advertising search queries)
    β”‚
    β”œβ”€β”€ Automated Scraping Pipeline
    β”‚     12,917 high-resolution images collected
    β”‚     40 categories: product photography, luxury goods, food styling,
    β”‚     automotive, fashion, technology, real estate, cosmetics, sports,
    β”‚     jewelry, furniture, beverages, electronics, watches, perfume...
    β”‚
    β”œβ”€β”€ Quality Filtering
    β”‚     Resolution β‰₯ 512px (shortest side)
    β”‚     Aesthetic scoring + perceptual deduplication (imagehash)
    β”‚     Avg resolution: 842Γ—886 px (median: 768Γ—1024)
    β”‚
    └── VLM Caption Enrichment (Qwen3-VL-8B, ~6.5 hours)
          Dual captions per image:
            caption_long:  150-250 word paragraph (layout, style, lighting, composition)
            caption_short: Concise tags ("product photography, smartphone, studio lighting")
          β†’ 25,834 total captions generated

Why I built my own dataset (I tried an existing one first):

ADImgNet (first attempt) Pexels (this model)
Images 3,201 (filtered from 9,003) 12,917
Max resolution 300px 1024px+
Avg resolution ~280px 842Γ—886px
Categories Academic labels 40 real commercial ad categories
Captions Original only Dual VLM-enriched (25,834)
Publicly available No Yes β€” open dataset

Model Architecture

Qwen-Image-2512 (20.5B params, S3-DiT, Flow Matching)
    β”‚
    β”œβ”€β”€ Text Encoder: Qwen2.5-VL
    β”‚     Processes prompt β†’ hidden states [batch, seq_len, 3584]
    β”‚     Pre-cached during training β†’ unloaded to save 6 GB VRAM
    β”‚
    β”œβ”€β”€ Transformer: 60 S3-DiT blocks (10.6B params)
    β”‚     Each block β†’ 12 LoRA-adapted linear layers:
    β”‚
    β”‚     Image Stream Attention [rank=64, DoRA]:
    β”‚       to_q, to_k, to_v, to_out.0
    β”‚
    β”‚     Text Stream Attention [rank=64, DoRA]:
    β”‚       add_q_proj, add_k_proj, add_v_proj, to_add_out
    β”‚
    β”‚     Image MLP [rank=32, DoRA]:
    β”‚       img_mlp.net.0.proj, img_mlp.net.2
    β”‚
    β”‚     Text MLP [rank=32, DoRA]:
    β”‚       txt_mlp.net.0.proj, txt_mlp.net.2
    β”‚
    β”‚     Total: 720 adapted layers | 428M trainable params (4.02%)
    β”‚
    └── VAE: AutoencoderKLQwenImage
          Decodes latents β†’ images

LoRA Configuration

peft_type: LORA
r: 64                      # Primary rank (attention modules)
lora_alpha: 128             # Scaling factor (effective scale = alpha/sqrt(r) via rsLoRA)
lora_dropout: 0.05
use_dora: true              # Weight-Decomposed LoRA (requires bf16 inference!)
use_rslora: true            # Rank-Stabilized scaling
rank_pattern:               # Adaptive rank per module type
  img_mlp: 32
  txt_mlp: 32
alpha_pattern:
  img_mlp: 64
  txt_mlp: 64
target_modules:             # 12 modules Γ— 60 blocks = 720 adapted layers
  - to_q, to_k, to_v, to_out.0
  - add_q_proj, add_k_proj, add_v_proj, to_add_out
  - img_mlp.net.0.proj, img_mlp.net.2
  - txt_mlp.net.0.proj, txt_mlp.net.2

Quick Start

Installation

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers peft accelerate

IMPORTANT: bf16 Precision Required

DoRA + Quantization = Degraded Output. This model uses DoRA (Weight-Decomposed LoRA) which decomposes weight updates into magnitude and direction components. Quantization corrupts these components. You must run inference at bf16 precision.

Inference (CPU Offload β€” Any GPU with 48 GB system RAM)

import torch
from diffusers import DiffusionPipeline
from peft import PeftModel

model_id = "Qwen/Qwen-Image-2512"
lora_id = "RayyanAhmed9477/qwen-image-2512-lora-advertisement"

# Load at bf16 (REQUIRED for DoRA)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.transformer = PeftModel.from_pretrained(pipe.transformer, lora_id)
pipe.enable_model_cpu_offload()  # Needs ~48 GB RAM, ~8 GB VRAM

# Generate β€” must use true_cfg_scale (guidance_scale is IGNORED for this model)
image = pipe(
    prompt="Professional product advertisement photograph of a premium smartphone, "
           "hero shot on clean gradient background, dramatic studio lighting, "
           "commercial photography, 8K quality",
    negative_prompt="",          # Required for true CFG to activate
    true_cfg_scale=4.0,
    num_inference_steps=30,
    height=768, width=768,
    generator=torch.Generator(device="cpu").manual_seed(42),
).images[0]

image.save("ad_output.png")

Inference (Full GPU β€” A100/H100 with 50+ GB VRAM)

pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image-2512", torch_dtype=torch.bfloat16)
pipe.transformer = PeftModel.from_pretrained(
    pipe.transformer, "RayyanAhmed9477/qwen-image-2512-lora-advertisement"
)
pipe.to("cuda")  # Requires ~50 GB VRAM

image = pipe(
    prompt="High-end luxury perfume advertisement, crystal bottle on black surface, "
           "dramatic lighting, premium commercial photography",
    negative_prompt="",
    true_cfg_scale=4.0,
    num_inference_steps=30,
    height=768, width=768,
    generator=torch.Generator(device="cpu").manual_seed(42),
).images[0]

Prompt Engineering Guide

The model responds best to structured prompts:

[Photography type] of [product/subject], [setting/background],
[lighting description], [photography style], [quality keywords]
Industry Example Prompt
Technology "Professional product advertisement of a flagship smartphone, clean gradient background, three-point studio lighting, commercial photography, ultra sharp"
Luxury "High-end luxury fragrance advertisement, crystal perfume bottle on glossy black surface, dramatic Rembrandt lighting, premium commercial photography"
Food "Premium artisan coffee advertisement, cappuccino with latte art in white ceramic cup, warm golden morning lighting, food photography"
Automotive "Luxury automotive advertisement, sleek black SUV on mountain road at golden hour, cinematic lighting, professional automotive photography"
Fashion "Premium luxury watch advertisement, timepiece on dark velvet, dramatic side lighting highlighting metallic details, editorial jewelry photography"
Real Estate "Luxury penthouse interior advertisement, modern living room with floor-to-ceiling windows, warm ambient lighting, architectural photography"

Training Configuration

Stage 1: Composition Learning (512px) β€” 63 Hours

Resolution:             512x512
Steps:                  10,000 (~3.4 epochs)
Speed:                  ~22.5 seconds/step
Duration:               ~63 hours
Batch Size:             1 (gradient accumulation = 4, effective batch = 4)
Optimizer:              AdamW 8-bit (lr=1e-4, cosine schedule, 300 warmup steps)
Loss:                   Min-SNR-gamma=5.0 weighted flow matching MSE
Noise Offset:           0.05 (improves contrast and dark areas)
Mixed Precision:        bf16 with gradient checkpointing
Final Loss:             0.96

Stage 2: Detail Sharpening (768px) β€” 17 Hours

Resolution:             768x768 (resume from checkpoint-10000)
Steps:                  1,500 (total: 11,500)
Speed:                  ~36 seconds/step
Duration:               ~17 hours
Learning Rate:          3e-5 (lowered to preserve Stage 1 features)
Warmup:                 50 steps
Final Loss:             0.17

Why curriculum learning works: Low resolution teaches what makes a good advertisement (composition, color harmony, product framing). High resolution then teaches how to render it sharply (edges, textures, fine detail). The lower learning rate in Stage 2 prevents catastrophic forgetting of the compositional understanding learned in Stage 1.

Training Curves

Individual training plots

Training Loss

Steady descent across both stages. The jump at step 10,000 reflects the resolution change from 512px to 768px β€” the model quickly adapts.

Learning Rate Schedule

Stage 1: Linear warmup to 1e-4, cosine decay over 10,000 steps. Stage 2: Warm restart to 3e-5, cosine decay over 1,500 steps.

Gradient Norm

Gradient clipping at 1.0 maintains stability across both resolution stages.

VRAM Usage

Consistent ~22 GB throughout, well within the RTX 3090's 24 GB limit.


Hardware & Reproducibility

Training Hardware

Component Specification
GPU NVIDIA RTX 3090 β€” 24 GB VRAM, consumer-grade ($1,500)
RAM 64 GB DDR4
OS Windows 11 Pro
Stage 1 Duration ~63 hours (10,000 steps Γ— 22.5s/step)
Stage 2 Duration ~17 hours (1,500 steps Γ— 36s/step)
Total Training ~80 hours on a single consumer GPU
Dataset Preparation ~8 hours (API scraping + filtering + VLM captioning)
Evaluation & Showcase ~4 hours (40 eval images + 24 showcase images)
Total Project GPU Time ~92 hours

Inference Requirements

Setup VRAM RAM Speed (768px, 30 steps) Quality
bf16 + CPU Offload ~8 GB ~48 GB ~120s Full quality
bf16 Full GPU (A100) ~50 GB ~16 GB ~20s Full quality
4-bit/8-bit (any) ~12 GB ~20 GB ~35s Degraded (DoRA issue)

Reproduce Training

git clone https://github.com/Rayyan9477/MarketMind
cd MarketMind/image_model_finetuning
pip install -r requirements.txt

# Stage 1: ~63 hours on RTX 3090
python train_lora.py --config configs/training_config.yaml

# Stage 2: ~17 hours on RTX 3090
python train_lora.py --config configs/training_config_hires.yaml \
    --resume-from-checkpoint outputs/checkpoints/checkpoint-10000

Lessons Learned

Building this model taught me more about engineering under constraints than any course or tutorial:

  1. Dataset quality > quantity > architecture. Switching from 300px ADImgNet images to 842px Pexels images was the single highest-impact change. No amount of LoRA rank or training steps can overcome a low-resolution, poorly-curated dataset.

  2. Resolution curriculum is free performance. Training at 512px (2.25Γ— cheaper per step) teaches composition first; fine-tuning at 768px adds sharpness. You get both for less than training at 768px alone.

  3. DoRA is powerful but has hidden deployment costs. The quality improvement is real. The bf16-only requirement for inference is painful. For production systems where quantized serving matters, standard LoRA may be more practical.

  4. VLM caption enrichment was the highest-ROI per-image investment. Running Qwen3-VL-8B over every training image gave the model far richer training signal. Stochastic 70/30 mixing of long/short captions improves both text alignment and unconditional generation quality.

  5. Pre-caching embeddings is non-negotiable on consumer GPUs. The 6 GB savings from unloading the text encoder was the difference between "impossible on RTX 3090" and "fits with headroom to spare."

  6. Build monitoring and graceful resume from day one. Over 80 hours on Windows, things go wrong. My SIGTERM-based graceful pause + checkpoint resume system saved the project multiple times.

  7. Windows memory management is its own engineering domain. CUDA allocations consume Windows commit charge (RAM + pagefile). I built commit-aware device mapping that queries available system commit via ctypes before placing model layers across GPU, CPU, and disk. This is rarely documented but critical for large model work on consumer hardware.


File Structure

qwen-image-2512-lora-advertisement/
β”œβ”€β”€ adapter_config.json              # LoRA config (DoRA, rsLoRA, 12 target modules)
β”œβ”€β”€ adapter_model.safetensors        # LoRA weights (1.7 GB)
β”œβ”€β”€ training_config.yaml             # Stage 1 config (512px, 10K steps)
β”œβ”€β”€ training_config_hires.yaml       # Stage 2 config (768px, 1.5K steps)
β”œβ”€β”€ evaluation_results_lora.json     # LoRA evaluation metrics (n=20)
β”œβ”€β”€ evaluation_results_base.json     # Base model evaluation metrics (n=20)
β”œβ”€β”€ README.md                        # This file
β”œβ”€β”€ showcase/                        # Generated showcase gallery
β”‚   β”œβ”€β”€ v2_seed_*_*.png              # LoRA outputs (8 categories Γ— 3 seeds = 24 images)
β”‚   β”œβ”€β”€ lora_*.png                   # Additional LoRA samples
β”‚   β”œβ”€β”€ base_*.png                   # Base model outputs for comparison
β”‚   β”œβ”€β”€ *_ad.png                     # Ad overlay composites
β”‚   └── comparison_grid.png          # Side-by-side comparison
β”œβ”€β”€ plots/                           # Training visualization
β”‚   β”œβ”€β”€ training_overview.png        # Combined 2Γ—2 overview
β”‚   β”œβ”€β”€ training_loss.png
β”‚   β”œβ”€β”€ learning_rate.png
β”‚   β”œβ”€β”€ grad_norm.png
β”‚   └── vram_usage.png
└── samples/                         # Additional evaluation samples

Troubleshooting

Images look garbled or blurry

You're running at quantized precision. This LoRA uses DoRA which requires bf16:

  • Ensure torch_dtype=torch.bfloat16 in from_pretrained()
  • Do NOT use BitsAndBytesConfig or load_in_4bit/load_in_8bit
  • Do NOT call merge_and_unload() β€” use PeftModel wrapping instead
Images don't match the prompt

Qwen-Image-2512 ignores guidance_scale. You must use true_cfg_scale:

pipe(prompt="...", negative_prompt="", true_cfg_scale=4.0)
Out of memory

Use CPU offload (needs ~48 GB system RAM but only ~8 GB VRAM):

pipe.enable_model_cpu_offload()  # Instead of pipe.to("cuda")
QwenImagePipeline not found

Update diffusers: pip install --upgrade diffusers>=0.33.0


Citation

@misc{ahmed2026qwenimagelora,
  title   = {Qwen-Image-2512 LoRA for Professional Advertisement Generation:
             80 Hours of DoRA + rsLoRA Fine-Tuning on Consumer Hardware},
  author  = {Rayyan Ahmed},
  year    = {2026},
  url     = {https://huggingface.co/RayyanAhmed9477/qwen-image-2512-lora-advertisement},
  note    = {80 hours training on a single NVIDIA RTX 3090. 12,917 Pexels images,
             two-stage resolution curriculum (512px to 768px), 11,500 steps,
             DoRA + rsLoRA with Min-SNR-gamma flow matching loss.
             Part of the MarketMind AI Marketing Platform.}
}

Acknowledgments


Built entirely on a single NVIDIA RTX 3090 | 80 hours of training | 12,917 images | 25,834 VLM captions

Stack: PEFT 0.18.0 | Diffusers 0.37.0 | DoRA + rsLoRA | Flow Matching | Min-SNR-gamma=5.0

Created by Rayyan Ahmed as part of the MarketMind AI Marketing Platform

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RayyanAhmed9477/qwen-image-2512-lora-advertisement

Adapter
(148)
this model

Dataset used to train RayyanAhmed9477/qwen-image-2512-lora-advertisement

Spaces using RayyanAhmed9477/qwen-image-2512-lora-advertisement 2

Paper for RayyanAhmed9477/qwen-image-2512-lora-advertisement

Evaluation results