RefDecoder

Reference-conditioned video VAE decoding for high-fidelity video reconstruction and generation.

Overview

RefDecoder is a training and inference framework that adds reference-frame conditioning to video autoencoders. By injecting a selected reference frame into the decoder, RefDecoder preserves appearance and identity cues across the video, improving reconstruction and image-to-video generation quality compared to the original VAE decoders.

This repository hosts the released RefDecoder checkpoints for two backbones:

Checkpoint	Backbone	File	Description
RefDecoder-Wan	Wan2.1 I2V VAE	`VAE/Wan2.1/wan2.1_ref.pt`	RefDecoder trained on top of the Wan2.1 image-to-video VAE decoder.
RefDecoder-VideoVAEPlus	VideoVAE+ (2+1D)	`VAE/VideoVAEPlus/videovaeplus_ref.pt`	RefDecoder trained on top of the VideoVAE+ autoencoder.

Download

Using huggingface_hub:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="RefDecoder/RefDecoder",
    local_dir="ckpt/RefDecoder",
)

Or with the CLI:

huggingface-cli download RefDecoder/RefDecoder --local-dir ckpt/RefDecoder

Expected layout after download (matching the code repo's defaults):

ckpt/
└── RefDecoder/
    └── VAE/
        ├── Wan2.1/
        │   └── wan2.1_ref.pt
        └── VideoVAEPlus/
            └── videovaeplus_ref.pt

Usage

Clone the code repository and follow its setup instructions:

git clone https://github.com/RefDecoder/RefDecoder.git
cd RefDecoder
pip install -U uv && uv sync && source .venv/bin/activate

Point the corresponding inference config to the downloaded checkpoint:

configs/inference/eval_wan.yaml — set model.params.ckpt_path to ckpt/RefDecoder/VAE/Wan2.1/wan2.1_ref.pt
configs/inference/eval_videovaeplus.yaml — set model.params.ckpt_path to ckpt/RefDecoder/VAE/VideoVAEPlus/videovaeplus_ref.pt

Wan2.1 reconstruction example:

bash scripts/run_inference.sh eval_wan /path/to/input_videos outputs/wan 17 480 832 cuda:0

VideoVAE+ reconstruction example:

bash scripts/run_inference.sh eval_videovaeplus /path/to/input_videos outputs/videovaeplus 16 216 216 cuda:0

Base-model requirements

RefDecoder-Wan initializes its base VAE from Wan-AI/Wan2.1-I2V-14B-480P-Diffusers (subfolder vae). Make sure that model is accessible or already cached locally.
RefDecoder-VideoVAEPlus requires the VideoVAE+ base checkpoint sota-4-16z.ckpt at ckpt/VideoVAEPlus/sota-4-16z.ckpt, or update the path in src/models/VideoVAEPlus/videovaeplus_ref0conv.py.

See the GitHub README for training, multi-GPU inference, and VBench image-to-video decoding workflows.

Citation

If you find RefDecoder useful, please cite:

@misc{fan2026refdecoderenhancingvisualgeneration,
      title={RefDecoder: Enhancing Visual Generation with Conditional Video Decoding}, 
      author={Xiang Fan and Yuheng Wang and Bohan Fang and Zhongzheng Ren and Ranjay Krishna},
      year={2026},
      eprint={2605.15196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15196}, 
}

License

Released under the Apache 2.0 License. See the LICENSE in the code repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for Arrokothwhi/RefDecoder

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Paper • 2605.15196 • Published 1 day ago