arxiv:2503.16397

Scale-wise Distillation of Diffusion Models

Published on Mar 20, 2025

· Submitted by

nikita on Mar 21, 2025

Upvote

Authors:

Nikita Starodubcev ,

Denis Kuznedelev ,

Dmitry Baranchuk

Abstract

A scale-wise distillation framework for diffusion models reduces computational costs and improves inference times by incorporating next-scale predictions and enhancing distribution matching methods.

AI-generated summary

We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.

View arXiv page View PDF Project page GitHub 119 Add to collection

Community

quickjkee

Paper author Paper submitter Mar 21, 2025

Scale-wise Distillation (SwD) is a novel framework for accelerating diffusion models (DMs) by progressively increasing spatial resolution during the generation process.
SwD achieves significant speedups (2.5× to 10×) compared to full-resolution models while maintaining or even improving image quality.

jbaron34

Mar 21, 2025

Tried it out and noticed that it struggles with aliasing artifacts due to the upscaling. Have you tried any alternative interpolation methods on the upscale step?

dbaranchuk

Paper author Mar 22, 2025

•

edited Mar 22, 2025

Thank you so much for pointing this out! Indeed, there was a bug in the inference code related to the upscaling method. After fixing it, the aliasing artifacts became negligible (take a look at the images). We truly appreciate you highlighting this issue. Feel free to try out the demo and share your feedback :)

librarian-bot

Mar 22, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

CodeExplode

24 days ago

•

edited 24 days ago

This seems like it would be the obvious way to go with for diffusion in general. Was there an obvious reason which was revealed in the project about why this may not be used more generally? It would seem to perhaps negate the need for tricks like timestep shifts as well.

dbaranchuk

Paper author 24 days ago

There are several recent works that do progressive sampling with general DMs, e.g., Pyramidal Flow Matching for Efficient Video Generative Modeling, Improving Progressive Generation with Decomposable Flow Matching, Flexidit: Your diffusion transformer can easily generate high-quality samples with less compute.

The key problem with general DMs is how to handle "jump" points since upscaling the intermediate noisy samples results in corrupted noise statistics. SwD avoids this with the stochastic multistep sampling that is specific to few-step models. Just in case, timestep shift is not critical and not specific to few-step models.

Among general DMs, I think FlexiDiT offers the most natural solution by simply changing the DiT patch size over sampling.