File size: 3,970 Bytes
67c98ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23311ca
67c98ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23311ca
 
 
 
 
 
 
 
 
67c98ad
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
license: apache-2.0
library_name: pytorch
pipeline_tag: image-to-video
tags:
- video-generation
- image-to-video
- vae
- video-vae
- video-reconstruction
- refdecoder
- wan2.1
- videovaeplus
---

# RefDecoder

Reference-conditioned video VAE decoding for high-fidelity video reconstruction and generation.

[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/abs/2605.15196)
[![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://refdecoder.github.io/)
[![GitHub](https://img.shields.io/badge/GitHub-Code-black?logo=github)](https://github.com/RefDecoder/RefDecoder)

## Overview

RefDecoder is a training and inference framework that adds reference-frame conditioning to video autoencoders. By injecting a selected reference frame into the decoder, RefDecoder preserves appearance and identity cues across the video, improving reconstruction and image-to-video generation quality compared to the original VAE decoders.

This repository hosts the released RefDecoder checkpoints for two backbones:

| Checkpoint | Backbone | File | Description |
| --- | --- | --- | --- |
| **RefDecoder-Wan** | Wan2.1 I2V VAE | `VAE/Wan2.1/wan2.1_ref.pt` | RefDecoder trained on top of the Wan2.1 image-to-video VAE decoder. |
| **RefDecoder-VideoVAEPlus** | VideoVAE+ (2+1D) | `VAE/VideoVAEPlus/videovaeplus_ref.pt` | RefDecoder trained on top of the VideoVAE+ autoencoder. |

## Download

Using `huggingface_hub`:

```python
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="RefDecoder/RefDecoder",
    local_dir="ckpt/RefDecoder",
)
```

Or with the CLI:

```bash
huggingface-cli download RefDecoder/RefDecoder --local-dir ckpt/RefDecoder
```

Expected layout after download (matching the code repo's defaults):

```text
ckpt/
└── RefDecoder/
    └── VAE/
        β”œβ”€β”€ Wan2.1/
        β”‚   └── wan2.1_ref.pt
        └── VideoVAEPlus/
            └── videovaeplus_ref.pt
```

## Usage

Clone the code repository and follow its setup instructions:

```bash
git clone https://github.com/RefDecoder/RefDecoder.git
cd RefDecoder
pip install -U uv && uv sync && source .venv/bin/activate
```

Point the corresponding inference config to the downloaded checkpoint:

- `configs/inference/eval_wan.yaml` β€” set `model.params.ckpt_path` to `ckpt/RefDecoder/VAE/Wan2.1/wan2.1_ref.pt`
- `configs/inference/eval_videovaeplus.yaml` β€” set `model.params.ckpt_path` to `ckpt/RefDecoder/VAE/VideoVAEPlus/videovaeplus_ref.pt`

Wan2.1 reconstruction example:

```bash
bash scripts/run_inference.sh eval_wan /path/to/input_videos outputs/wan 17 480 832 cuda:0
```

VideoVAE+ reconstruction example:

```bash
bash scripts/run_inference.sh eval_videovaeplus /path/to/input_videos outputs/videovaeplus 16 216 216 cuda:0
```

### Base-model requirements

- **RefDecoder-Wan** initializes its base VAE from `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` (subfolder `vae`). Make sure that model is accessible or already cached locally.
- **RefDecoder-VideoVAEPlus** requires the VideoVAE+ base checkpoint `sota-4-16z.ckpt` at `ckpt/VideoVAEPlus/sota-4-16z.ckpt`, or update the path in `src/models/VideoVAEPlus/videovaeplus_ref0conv.py`.

See the [GitHub README](https://github.com/RefDecoder/RefDecoder) for training, multi-GPU inference, and VBench image-to-video decoding workflows.

## Citation

If you find RefDecoder useful, please cite:

```bibtex
@misc{fan2026refdecoderenhancingvisualgeneration,
      title={RefDecoder: Enhancing Visual Generation with Conditional Video Decoding}, 
      author={Xiang Fan and Yuheng Wang and Bohan Fang and Zhongzheng Ren and Ranjay Krishna},
      year={2026},
      eprint={2605.15196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15196}, 
}
```

## License

Released under the Apache 2.0 License. See the [LICENSE](https://github.com/RefDecoder/RefDecoder/blob/main/LICENSE) in the code repository.