arxiv:2605.25191

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Published on May 24

· Submitted by

Iason Skylitsis on May 26

University of Amsterdam

Upvote

Authors:

Abstract

Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining.

AI-generated summary

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.

View arXiv page View PDF GitHub 0 Add to collection

Community

iasonsky

Paper submitter about 7 hours ago

avahal

43 minutes ago

the image aligner, trained with infoNCE and cross-attention reconstruction to map image tokens into the text-embedding space, is the most interesting bit here. that design is what lets the inference-time fusion work without touching the diffusion model, avoiding retraining entirely. what happens if the reference image has occlusions or unusual perspective shifts would that hurt the prompt fidelity more than it harms the visual alignment? btw the arxivlens breakdown helped me parse the method details, and it might be worth adding a small ablation on fusion variants in future work https://arxivlens.com/PaperView/Details/injecting-image-guidance-into-text-conditioned-diffusion-models-at-inference-975-b7274edd

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.25191

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.25191 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.25191 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.25191 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.