arxiv:2605.21195

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Published on May 20

Authors:

Abstract

Discrete autoregressive text-to-image models suffer from latent covariate shift during policy optimization, which RankE addresses through end-to-end co-evolution of policy and decoder components.

AI-generated summary

Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

View arXiv page View PDF Add to collection

Community

syjian

Paper author about 16 hours ago

Project Page: https://syjmelody.github.io/RankE/
GitHub: https://github.com/syjmelody/RankE

⚡ TL;DR

RankE is the first end-to-end post-training framework for discrete text-to-image generation that jointly optimizes the Generator and the Decoder. Instead of improving reward scores at the cost of image quality, RankE improves both alignment and fidelity at the same time.

🤔 Background: What is the problem?

Most discrete text-to-image models still follow a two-stage pipeline:

train a VQ-VAE / tokenizer to map images into discrete visual tokens;
train an autoregressive Generator to model those tokens.

This pipeline works well for pretraining, but post-training is usually incomplete: existing methods optimize the Generator only and keep the Decoder frozen.

That creates a mismatch. As the Generator is optimized to chase higher rewards, its output token distribution gradually drifts away from the real token distribution that the Decoder saw during tokenizer training. The result is a frustrating trade-off:

reward scores go up,
but decoded image quality can get worse.

The paper identifies this issue as Latent Covariate Shift.

🔍 Why existing solutions are not enough

Recent work such as REPA-E has shown that in continuous diffusion models, the autoencoder is not just a supporting module — it can be a real bottleneck for alignment and visual quality.

But discrete T2I is harder.
Because token sampling and vector quantization are discrete operations, gradients cannot flow cleanly through the entire generation process. That is why most existing RL or preference-optimization methods for discrete generation still update only the Generator while leaving the Decoder unchanged.

So the field already knows the decoder matters — but a practical end-to-end solution for discrete generation has been missing.

🚀 What RankE does

RankE addresses this directly by making the Generator and Decoder co-evolve.

Its core insight is simple:
if Generator optimization already behaves like a ranking process over latent token sequences, why not extend the same ranking principle to pixel-space Decoder optimization?

RankE therefore uses alternating optimization:

Generator step: optimize the policy so that higher-reward latent token sequences receive stronger updates;
Decoder step: optimize the Decoder so it can better adapt to the Generator’s evolving token distribution, while also favoring higher-reward decoded images.

In other words, RankE does not just make the model “better at scoring well.” It aligns optimization across both latent space and pixel space.

This is the key difference from standard frozen-decoder RL.

🧠 Why this matters

In standard RL post-training for discrete T2I, the Generator keeps changing, but the Decoder stays fixed. Over time, the Decoder is forced to decode token patterns it was never really trained to handle.

RankE removes this bottleneck by continuously adapting the Decoder during post-training. This turns reward optimization into actual visual improvement, rather than reward hacking in latent space.

📈 Results

The gains are clear.

On LlamaGen-XL (775M) under CLIP-based optimization:

Standard RL: improves CLIP, but hurts FID
RankE: improves both

Specifically:

CLIP: 32.45 → 33.76
FID: 17.76 → 15.21

That is the main message of the paper:
RankE breaks the common fidelity–alignment trade-off in discrete text-to-image post-training.

The improvements are also consistent across:

different backbones,
different reward functions,
and multiple evaluation settings.

✨ One-line takeaway

RankE is a more natural way to post-train discrete text-to-image models: instead of optimizing only the Generator, it lets the Generator and Decoder improve together.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.21195

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21195 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21195 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21195 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.