Qwen-Image-2512 LoRA for Professional Advertisement Generation

80 Hours. One RTX 3090. Zero Cloud Compute.

Metric	Value
Training Time	80 hours on a single consumer GPU
Base Model	Qwen-Image-2512 (20.5B parameters)
Dataset	12,917 curated images + 25,834 VLM-enriched captions
GPU Hardware	NVIDIA RTX 3090 — 24 GB VRAM, consumer-grade ($1,500)
Adapter Size	1.7 GB (428M trainable params, 4.02% of transformer)
Resolution	2-stage curriculum: 512px → 768px
Techniques	DoRA, rsLoRA, Min-SNR-gamma, 4-bit QLoRA, embedding pre-caching

Rayyan Ahmed — GitHub | LinkedIn | ModelScope | HuggingFace

Part of the MarketMind AI Marketing Platform

Competition Entry: 2026 Qwen-Image LoRA Training Competition — Track 1: AI for Production (branding, e-commerce, industrial design). Hosted by Tongyi Lab + ModelScope.

TL;DR

A 1.7 GB LoRA adapter that transforms the 20.5 billion parameter Qwen-Image-2512 into a professional advertisement and product photography engine. Trained on 12,917 curated Pexels images across 40 advertising categories using a two-stage resolution curriculum (512px → 768px) on a single NVIDIA RTX 3090 — a consumer GPU with just 24 GB of VRAM. No cloud compute, no A100, no H100. Just engineering.

The entire pipeline — dataset curation, VLM caption enrichment, multi-stage training, evaluation, and deployment — was designed from scratch to operate within the hard constraints of consumer hardware, accumulating 80+ hours of continuous GPU training.

Showcase Gallery

All images generated at 768px with the LoRA adapter using True CFG guidance (scale=4.0, 100 inference steps).

Product Photography Across Industries

_{Luxury Fragrance Crystal perfume bottle, dramatic Rembrandt lighting}	_{Technology Flagship smartphone, clean studio lighting}	_{Food & Beverage Artisan cappuccino, warm morning tones}	_{Automotive Luxury SUV, golden hour mountain road}
_{Fashion & Jewelry Luxury timepiece, moody dramatic lighting}	_{Sports & Fitness Performance running shoe, dynamic floating shot}	_{Skincare & Beauty Glass serum bottle, marble surface, natural light}	_{Real Estate Penthouse interior, city skyline view}

Seed Diversity — Same Prompt, Three Different Seeds

Demonstrating the model's ability to generate diverse, high-quality variations from the same prompt.

	Seed 42	Seed 2026	Seed 777
Luxury Perfume
Technology
Fitness
Real Estate

Ad Overlays — Production-Ready Output

_{Tech Ad}

_{Luxury Ad}

_{Food Ad}

_{Auto Ad}

LoRA vs Base Model Comparison

Generation Parameters

Resolution: 768×768 (showcase), 100 inference steps
True CFG Scale: 4.0 with negative prompt
Seeds: 42, 2026, 777 (3 seeds × 8 prompts = 24 LoRA showcase images)
Precision: Full bf16 with pre-merged DoRA weights
Hardware: Single RTX 3090 (24 GB VRAM) + 64 GB RAM
LoRA Scale: 0.65

The Engineering Challenge

Fitting 20.5 Billion Parameters on a $1,500 Consumer GPU

Qwen-Image-2512 has a 20.5 billion parameter S3-DiT transformer. Loading it in bf16 requires 41 GB of VRAM. My RTX 3090 has 24 GB.

This isn't a "just reduce batch size" problem. Every architectural decision — from training strategy to inference pipeline — was dictated by this hard constraint. Here's how I solved it:

Challenge	Solution	Impact
41 GB model vs 24 GB VRAM	4-bit NF4 quantization during training	Fits in ~10 GB VRAM
Text encoder takes 6 GB	Pre-cache all 25,834 embeddings to disk, then unload encoder	Freed 6 GB for larger LoRA rank
Need bf16 for DoRA inference	Custom shard-by-shard LoRA merge + disk offload pipeline	Full quality on consumer hardware
80 hours of training	Graceful SIGTERM/SIGINT checkpointing + resume system	Survived power events & crashes
Windows pagefile limits	Custom Python mmap for safetensors + commit-aware device mapping	Zero-copy model loading
Mixed resolution dataset	Two-stage curriculum (512px → 768px)	2.25× faster than 768px-only training
CPU RAM pressure during inference	Commit-aware device mapping via ctypes MEMORYSTATUSEX	Dynamic GPU/CPU/disk layer placement

80 Hours of Continuous GPU Time

This was not a quick weekend project. The training ran for 80 continuous hours on a single GPU:

Stage 1: 512px Resolution &mdash; Learning Composition
  Steps:     10,000 (~3.4 epochs over 12,917 images)
  Speed:     ~22.5 seconds/step
  Duration:  ~63 hours
  Final loss: 0.96

Stage 2: 768px Resolution &mdash; Sharpening Detail
  Steps:     1,500 (fine-tuning from step 10,000)
  Speed:     ~36 seconds/step
  Duration:  ~17 hours
  Final loss: 0.17

Total:       11,500 steps | ~80 hours | Single RTX 3090

For context: this same training would take ~20 hours on an A100 or ~12 hours on an H100. I didn't have access to either. Every optimization mattered.

The VRAM Budget

Component                         VRAM Usage
─────────────────────────────────────────────
Transformer (4-bit NF4)           ~10.0 GB
LoRA adapters (720 layers)         ~1.5 GB
VAE (bf16)                         ~0.5 GB
Optimizer states (AdamW 8-bit)     ~3.0 GB
Activations (grad checkpoint)      ~4.0 GB
Overhead & buffers                 ~3.0 GB
─────────────────────────────────────────────
TOTAL                             ~22.0 GB / 24 GB available
Text encoder: UNLOADED (embeddings pre-cached to disk, saving 6 GB)

8 Technical Innovations

#	Innovation	What & Why
1	DoRA + rsLoRA	Weight-Decomposed LoRA closes the gap with full fine-tuning. rsLoRA stabilizes high-rank training at r=64
2	12 target modules per block	Most LoRAs target 4 attention projections. This adapter targets all 12 linear layers (attention + MLP, both image and text streams) for deeper style adaptation
3	Two-stage resolution curriculum	10K steps at 512px learns composition; 1.5K steps at 768px sharpens detail. 2.25× more efficient than full-resolution training
4	Min-SNR-gamma=5.0	Flow matching loss weighting that prevents wasting gradient updates on trivially noisy timesteps
5	VLM dual-caption enrichment	Qwen3-VL-8B generated 25,834 captions: detailed paragraphs + concise tags. Stochastic 70/30 mixing during training for robust text alignment
6	Adaptive rank pattern	r=64 for attention modules, r=32 for MLP modules. MLPs need less capacity — saving VRAM where it matters least
7	Text embedding pre-caching	Encode all captions once, persist to disk, unload the 6 GB text encoder. Train indefinitely on cached embeddings
8	Custom 12,917-image dataset	Built from scratch via Pexels API — filtered, deduplicated, VLM-enriched, and published as an open dataset

The DoRA Discovery

After 80 hours of training, I loaded the adapter at 4-bit quantization. The output was garbage. I tried 8-bit — still garbage. After two days of debugging, I tried full bf16 and the output was perfect.

DoRA breaks under quantization. Its magnitude-direction weight decomposition compounds rounding errors from both components, corrupting the learned adaptation. Standard LoRA tolerates quantization fine; DoRA does not.

This is not documented anywhere in PEFT, the DoRA paper, or the HuggingFace ecosystem. I'm documenting it here so others don't waste two days.

Inference Method	Quality	Explanation
bf16 + PeftModel	Professional, sharp	Full DoRA precision preserved
8-bit quantized	Garbled	Magnitude-direction decomposition corrupted
4-bit quantized	Garbled	Same issue, worse
merge_and_unload (any quant)	Garbled	PEFT warns "rounding errors"

Evaluation Results

Quantitative Metrics (n=20 per model, identical prompts & seeds)

Model	Aesthetic Score	CLIP Score	Aesthetic Std
Base Qwen-Image-2512 (8-bit + True CFG)	5.50 ± 1.56	0.1849 ± 0.061	1.56
+ LoRA Adapter (bf16, this model)	5.56 ± 1.06	0.1188 ± 0.038	1.06

Key findings:

Aesthetic score improved (+1.1%) — the LoRA generates more visually appealing ad imagery
Consistency dramatically improved — standard deviation dropped from 1.56 → 1.06 (32% reduction). The adapter produces reliably good output rather than high-variance results
CLIP score trade-off — this is expected and intentional. The LoRA was trained to produce advertising-style imagery which prioritizes aesthetic impact and mood over literal text-to-image correspondence. A perfume ad doesn't literally depict "dramatic Rembrandt lighting" — it creates Rembrandt lighting
Lower CLIP variance (0.038 vs 0.061) confirms more consistent style adherence across diverse advertising categories

A note on metrics for domain-specific models: CLIP score measures general text-image similarity using ViT-L-14. For specialized models like advertisement generators, aesthetic quality and output consistency are more meaningful production metrics than raw text alignment. The showcase gallery above demonstrates actual output quality across 8 industries and 3 seeds.

Dataset: Pexels Advertisement Photography

Open dataset: RayyanAhmed9477/pexels-advertisement-photography (12,917 images, 25,834 captions)

Built entirely from scratch for this project:

Pexels API (40 targeted advertising search queries)
    │
    ├── Automated Scraping Pipeline
    │     12,917 high-resolution images collected
    │     40 categories: product photography, luxury goods, food styling,
    │     automotive, fashion, technology, real estate, cosmetics, sports,
    │     jewelry, furniture, beverages, electronics, watches, perfume...
    │
    ├── Quality Filtering
    │     Resolution ≥ 512px (shortest side)
    │     Aesthetic scoring + perceptual deduplication (imagehash)
    │     Avg resolution: 842×886 px (median: 768×1024)
    │
    └── VLM Caption Enrichment (Qwen3-VL-8B, ~6.5 hours)
          Dual captions per image:
            caption_long:  150-250 word paragraph (layout, style, lighting, composition)
            caption_short: Concise tags ("product photography, smartphone, studio lighting")
          → 25,834 total captions generated

Why I built my own dataset (I tried an existing one first):

	ADImgNet (first attempt)	Pexels (this model)
Images	3,201 (filtered from 9,003)	12,917
Max resolution	300px	1024px+
Avg resolution	~280px	842×886px
Categories	Academic labels	40 real commercial ad categories
Captions	Original only	Dual VLM-enriched (25,834)
Publicly available	No	Yes — open dataset

Model Architecture

Qwen-Image-2512 (20.5B params, S3-DiT, Flow Matching)
    │
    ├── Text Encoder: Qwen2.5-VL
    │     Processes prompt → hidden states [batch, seq_len, 3584]
    │     Pre-cached during training → unloaded to save 6 GB VRAM
    │
    ├── Transformer: 60 S3-DiT blocks (10.6B params)
    │     Each block → 12 LoRA-adapted linear layers:
    │
    │     Image Stream Attention [rank=64, DoRA]:
    │       to_q, to_k, to_v, to_out.0
    │
    │     Text Stream Attention [rank=64, DoRA]:
    │       add_q_proj, add_k_proj, add_v_proj, to_add_out
    │
    │     Image MLP [rank=32, DoRA]:
    │       img_mlp.net.0.proj, img_mlp.net.2
    │
    │     Text MLP [rank=32, DoRA]:
    │       txt_mlp.net.0.proj, txt_mlp.net.2
    │
    │     Total: 720 adapted layers | 428M trainable params (4.02%)
    │
    └── VAE: AutoencoderKLQwenImage
          Decodes latents → images

LoRA Configuration

peft_type: LORA
r: 64                      # Primary rank (attention modules)
lora_alpha: 128             # Scaling factor (effective scale = alpha/sqrt(r) via rsLoRA)
lora_dropout: 0.05
use_dora: true              # Weight-Decomposed LoRA (requires bf16 inference!)
use_rslora: true            # Rank-Stabilized scaling
rank_pattern:               # Adaptive rank per module type
  img_mlp: 32
  txt_mlp: 32
alpha_pattern:
  img_mlp: 64
  txt_mlp: 64
target_modules:             # 12 modules × 60 blocks = 720 adapted layers
  - to_q, to_k, to_v, to_out.0
  - add_q_proj, add_k_proj, add_v_proj, to_add_out
  - img_mlp.net.0.proj, img_mlp.net.2
  - txt_mlp.net.0.proj, txt_mlp.net.2

Quick Start

Installation

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers peft accelerate

IMPORTANT: bf16 Precision Required

DoRA + Quantization = Degraded Output. This model uses DoRA (Weight-Decomposed LoRA) which decomposes weight updates into magnitude and direction components. Quantization corrupts these components. You must run inference at bf16 precision.

Inference (CPU Offload — Any GPU with 48 GB system RAM)

import torch
from diffusers import DiffusionPipeline
from peft import PeftModel

model_id = "Qwen/Qwen-Image-2512"
lora_id = "RayyanAhmed9477/qwen-image-2512-lora-advertisement"

# Load at bf16 (REQUIRED for DoRA)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.transformer = PeftModel.from_pretrained(pipe.transformer, lora_id)
pipe.enable_model_cpu_offload()  # Needs ~48 GB RAM, ~8 GB VRAM

# Generate — must use true_cfg_scale (guidance_scale is IGNORED for this model)
image = pipe(
    prompt="Professional product advertisement photograph of a premium smartphone, "
           "hero shot on clean gradient background, dramatic studio lighting, "
           "commercial photography, 8K quality",
    negative_prompt="",          # Required for true CFG to activate
    true_cfg_scale=4.0,
    num_inference_steps=30,
    height=768, width=768,
    generator=torch.Generator(device="cpu").manual_seed(42),
).images[0]

image.save("ad_output.png")

Inference (Full GPU — A100/H100 with 50+ GB VRAM)

pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image-2512", torch_dtype=torch.bfloat16)
pipe.transformer = PeftModel.from_pretrained(
    pipe.transformer, "RayyanAhmed9477/qwen-image-2512-lora-advertisement"
)
pipe.to("cuda")  # Requires ~50 GB VRAM

image = pipe(
    prompt="High-end luxury perfume advertisement, crystal bottle on black surface, "
           "dramatic lighting, premium commercial photography",
    negative_prompt="",
    true_cfg_scale=4.0,
    num_inference_steps=30,
    height=768, width=768,
    generator=torch.Generator(device="cpu").manual_seed(42),
).images[0]

Prompt Engineering Guide

The model responds best to structured prompts:

[Photography type] of [product/subject], [setting/background],
[lighting description], [photography style], [quality keywords]

Industry	Example Prompt
Technology	"Professional product advertisement of a flagship smartphone, clean gradient background, three-point studio lighting, commercial photography, ultra sharp"
Luxury	"High-end luxury fragrance advertisement, crystal perfume bottle on glossy black surface, dramatic Rembrandt lighting, premium commercial photography"
Food	"Premium artisan coffee advertisement, cappuccino with latte art in white ceramic cup, warm golden morning lighting, food photography"
Automotive	"Luxury automotive advertisement, sleek black SUV on mountain road at golden hour, cinematic lighting, professional automotive photography"
Fashion	"Premium luxury watch advertisement, timepiece on dark velvet, dramatic side lighting highlighting metallic details, editorial jewelry photography"
Real Estate	"Luxury penthouse interior advertisement, modern living room with floor-to-ceiling windows, warm ambient lighting, architectural photography"

Training Configuration

Stage 1: Composition Learning (512px) — 63 Hours

Resolution:             512x512
Steps:                  10,000 (~3.4 epochs)
Speed:                  ~22.5 seconds/step
Duration:               ~63 hours
Batch Size:             1 (gradient accumulation = 4, effective batch = 4)
Optimizer:              AdamW 8-bit (lr=1e-4, cosine schedule, 300 warmup steps)
Loss:                   Min-SNR-gamma=5.0 weighted flow matching MSE
Noise Offset:           0.05 (improves contrast and dark areas)
Mixed Precision:        bf16 with gradient checkpointing
Final Loss:             0.96

Stage 2: Detail Sharpening (768px) — 17 Hours

Resolution:             768x768 (resume from checkpoint-10000)
Steps:                  1,500 (total: 11,500)
Speed:                  ~36 seconds/step
Duration:               ~17 hours
Learning Rate:          3e-5 (lowered to preserve Stage 1 features)
Warmup:                 50 steps
Final Loss:             0.17

Why curriculum learning works: Low resolution teaches what makes a good advertisement (composition, color harmony, product framing). High resolution then teaches how to render it sharply (edges, textures, fine detail). The lower learning rate in Stage 2 prevents catastrophic forgetting of the compositional understanding learned in Stage 1.

Training Curves

Individual training plots

Training Loss

Steady descent across both stages. The jump at step 10,000 reflects the resolution change from 512px to 768px — the model quickly adapts.

Learning Rate Schedule

Stage 1: Linear warmup to 1e-4, cosine decay over 10,000 steps. Stage 2: Warm restart to 3e-5, cosine decay over 1,500 steps.

Gradient Norm

Gradient clipping at 1.0 maintains stability across both resolution stages.

VRAM Usage

Consistent ~22 GB throughout, well within the RTX 3090's 24 GB limit.

Hardware & Reproducibility

Training Hardware

Component	Specification
GPU	NVIDIA RTX 3090 — 24 GB VRAM, consumer-grade ($1,500)
RAM	64 GB DDR4
OS	Windows 11 Pro
Stage 1 Duration	~63 hours (10,000 steps × 22.5s/step)
Stage 2 Duration	~17 hours (1,500 steps × 36s/step)
Total Training	~80 hours on a single consumer GPU
Dataset Preparation	~8 hours (API scraping + filtering + VLM captioning)
Evaluation & Showcase	~4 hours (40 eval images + 24 showcase images)
Total Project GPU Time	~92 hours

Inference Requirements

Setup	VRAM	RAM	Speed (768px, 30 steps)	Quality
bf16 + CPU Offload	~8 GB	~48 GB	~120s	Full quality
bf16 Full GPU (A100)	~50 GB	~16 GB	~20s	Full quality
4-bit/8-bit (any)	~12 GB	~20 GB	~35s	Degraded (DoRA issue)

Reproduce Training

git clone https://github.com/Rayyan9477/MarketMind
cd MarketMind/image_model_finetuning
pip install -r requirements.txt

# Stage 1: ~63 hours on RTX 3090
python train_lora.py --config configs/training_config.yaml

# Stage 2: ~17 hours on RTX 3090
python train_lora.py --config configs/training_config_hires.yaml \
    --resume-from-checkpoint outputs/checkpoints/checkpoint-10000

Lessons Learned

Building this model taught me more about engineering under constraints than any course or tutorial:

Dataset quality > quantity > architecture. Switching from 300px ADImgNet images to 842px Pexels images was the single highest-impact change. No amount of LoRA rank or training steps can overcome a low-resolution, poorly-curated dataset.
Resolution curriculum is free performance. Training at 512px (2.25× cheaper per step) teaches composition first; fine-tuning at 768px adds sharpness. You get both for less than training at 768px alone.
DoRA is powerful but has hidden deployment costs. The quality improvement is real. The bf16-only requirement for inference is painful. For production systems where quantized serving matters, standard LoRA may be more practical.
VLM caption enrichment was the highest-ROI per-image investment. Running Qwen3-VL-8B over every training image gave the model far richer training signal. Stochastic 70/30 mixing of long/short captions improves both text alignment and unconditional generation quality.
Pre-caching embeddings is non-negotiable on consumer GPUs. The 6 GB savings from unloading the text encoder was the difference between "impossible on RTX 3090" and "fits with headroom to spare."
Build monitoring and graceful resume from day one. Over 80 hours on Windows, things go wrong. My SIGTERM-based graceful pause + checkpoint resume system saved the project multiple times.
Windows memory management is its own engineering domain. CUDA allocations consume Windows commit charge (RAM + pagefile). I built commit-aware device mapping that queries available system commit via ctypes before placing model layers across GPU, CPU, and disk. This is rarely documented but critical for large model work on consumer hardware.

File Structure

qwen-image-2512-lora-advertisement/
├── adapter_config.json              # LoRA config (DoRA, rsLoRA, 12 target modules)
├── adapter_model.safetensors        # LoRA weights (1.7 GB)
├── training_config.yaml             # Stage 1 config (512px, 10K steps)
├── training_config_hires.yaml       # Stage 2 config (768px, 1.5K steps)
├── evaluation_results_lora.json     # LoRA evaluation metrics (n=20)
├── evaluation_results_base.json     # Base model evaluation metrics (n=20)
├── README.md                        # This file
├── showcase/                        # Generated showcase gallery
│   ├── v2_seed_*_*.png              # LoRA outputs (8 categories × 3 seeds = 24 images)
│   ├── lora_*.png                   # Additional LoRA samples
│   ├── base_*.png                   # Base model outputs for comparison
│   ├── *_ad.png                     # Ad overlay composites
│   └── comparison_grid.png          # Side-by-side comparison
├── plots/                           # Training visualization
│   ├── training_overview.png        # Combined 2×2 overview
│   ├── training_loss.png
│   ├── learning_rate.png
│   ├── grad_norm.png
│   └── vram_usage.png
└── samples/                         # Additional evaluation samples

Troubleshooting

Images look garbled or blurry

You're running at quantized precision. This LoRA uses DoRA which requires bf16:

Ensure torch_dtype=torch.bfloat16 in from_pretrained()
Do NOT use BitsAndBytesConfig or load_in_4bit/load_in_8bit
Do NOT call merge_and_unload() — use PeftModel wrapping instead

Images don't match the prompt

Qwen-Image-2512 ignores guidance_scale. You must use true_cfg_scale:

pipe(prompt="...", negative_prompt="", true_cfg_scale=4.0)

Out of memory

Use CPU offload (needs ~48 GB system RAM but only ~8 GB VRAM):

pipe.enable_model_cpu_offload()  # Instead of pipe.to("cuda")

QwenImagePipeline not found

Update diffusers: pip install --upgrade diffusers>=0.33.0

Citation

@misc{ahmed2026qwenimagelora,
  title   = {Qwen-Image-2512 LoRA for Professional Advertisement Generation:
             80 Hours of DoRA + rsLoRA Fine-Tuning on Consumer Hardware},
  author  = {Rayyan Ahmed},
  year    = {2026},
  url     = {https://huggingface.co/RayyanAhmed9477/qwen-image-2512-lora-advertisement},
  note    = {80 hours training on a single NVIDIA RTX 3090. 12,917 Pexels images,
             two-stage resolution curriculum (512px to 768px), 11,500 steps,
             DoRA + rsLoRA with Min-SNR-gamma flow matching loss.
             Part of the MarketMind AI Marketing Platform.}
}

Acknowledgments

ModelScope — Hosting the 2026 Qwen-Image LoRA Training Competition
Qwen Team / Tongyi Lab — The Qwen-Image-2512 base model
Hugging Face — Diffusers, PEFT, Transformers, and Accelerate libraries
DoRA (Liu et al., 2024) — Weight-Decomposed Low-Rank Adaptation
Pexels — High-quality stock photography under permissive license

Built entirely on a single NVIDIA RTX 3090 | 80 hours of training | 12,917 images | 25,834 VLM captions

Stack: PEFT 0.18.0 | Diffusers 0.37.0 | DoRA + rsLoRA | Flow Matching | Min-SNR-gamma=5.0

Created by Rayyan Ahmed as part of the MarketMind AI Marketing Platform

Downloads last month: 26

Model tree for RayyanAhmed9477/qwen-image-2512-lora-advertisement

Base model

Qwen/Qwen-Image-2512

Adapter

(148)

this model

Dataset used to train RayyanAhmed9477/qwen-image-2512-lora-advertisement

Spaces using RayyanAhmed9477/qwen-image-2512-lora-advertisement 2

Paper for RayyanAhmed9477/qwen-image-2512-lora-advertisement

DoRA: Weight-Decomposed Low-Rank Adaptation

Paper • 2402.09353 • Published Feb 14, 2024 • 32

Evaluation results

Aesthetic Score (LAION)
self-reported

5.560
CLIP Score (ViT-L-14)
self-reported

0.119