Papers
arxiv:2605.26108

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Published on May 25
· Submitted by
Yushi Huang
on May 26
Authors:
,
,
,
,

Abstract

RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.

AI-generated summary

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

Community

Paper author Paper submitter

We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a
two-stage framework that unifies distribution-matching distillation with
reward-guided RL for few-step flow generators. Minimizing the KL divergence to
a reward-tilted teacher distribution decomposes naturally into a
distribution-matching term and a reward-maximization term — instantiated
as Ambient-Consistent DMD (AC-DMD) for the cold start and a hybrid policy
gradient
(SubGRPO + final-step reward back-propagation) for the RL stage.
With 4 NFE RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the
distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most
rewards.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26108
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26108 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26108 in a Space README.md to link it from this page.

Collections including this paper 1