File size: 6,079 Bytes
3626511 c565789 3626511 c565789 3626511 c565789 3626511 c565789 3626511 565efd1 3626511 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | ---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- text-to-video
- video-generation
- diffusion
- autoregressive
- consistency-model
- grpo
- wan2.1
- raven
base_model:
- Wan-AI/Wan2.1-T2V-1.3B
pipeline_tag: text-to-video
---
# RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
[Yanzuo Lu](https://yanzuo.lu/) · [Ronglai Zuo](https://2000zrl.github.io/) · [Jiankang Deng](https://jiankangdeng.github.io/) — Imperial College London
Project page: https://yanzuo.lu/raven
## Overview
RAVEN is a causal autoregressive text-to-video generation model built on Wan2.1-T2V-1.3B. It is designed for real-time streaming video generation by extrapolating future video chunks from previously generated content.
The release contains the RAVEN checkpoint plus three interchangeable CM-GRPO variants:
| File | Description |
| --- | --- |
| `raven_model.pt` | Full RAVEN backbone for causal autoregressive text-to-video generation. |
| `cmgrpo_raven_lora.safetensors` | CM-GRPO LoRA adapter only. Load `raven_model.pt` as the base weight and this file through the LoRA path. |
| `cmgrpo_raven_full.pt` | RAVEN base and CM-GRPO LoRA adapter packed into one PEFT-wrapped state dict. Load this file through the LoRA path without a separate base weight. |
| `cmgrpo_raven_merge.pt` | Full CM-GRPO backbone with the adapter already merged into RAVEN. Load this file as the base weight, with no LoRA block. |
RAVEN trains a causal video generator using a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This aligns the model's training attention pattern with inference-time autoregressive extrapolation and allows downstream chunk losses to supervise the historical representations used for future predictions.
We also release CM-GRPO weights. CM-GRPO formulates a consistency-model sampling step as a conditional Gaussian transition and applies online Group Relative Policy Optimization directly to this kernel.
## Model details
- Base architecture: Wan2.1-T2V-1.3B DiT
- Task: text-to-video generation
- Generation mode: causal autoregressive video extrapolation
- Resolution used in released configs: 480 x 832
- Frames: 81
- FPS: 16
- Sampling steps: 4
- Sampler: consistency sampler
- Schedule: linear interpolation schedule, `v_lerp` prediction type
- Classifier-free guidance: not used; the `guidance_scale=3.0` value in the configs is a placeholder for interface compatibility
- Causal chunking: `chunk_size=3`, `independent_first_chunk=3`, `sink=0`, `window_size=null`
- VAE stride: `[4, 8, 8]`
- Latent channels: 16
- DiT config: dim 1536, 30 layers, 12 heads, FFN dim 8960, text length 512
## Usage
This repository only hosts the released model weights. Please use the RAVEN codebase for inference and evaluation:
```bash
git clone https://github.com/YanzuoLu/RAVEN.git
cd RAVEN
```
Set up the environment:
```bash
conda env create -f tools/environment.yaml
conda activate raven
bash tools/prepare_venv.sh
source venv/bin/activate
```
Download this model repository:
```bash
hf download mvp-lab/RAVEN --local-dir /path/to/RAVEN-weights
```
Then point the relevant config files to the downloaded checkpoints. RAVEN itself (`raven_model.pt`) is a single full backbone:
```jsonc
"backbone": {
"weight": "/path/to/RAVEN-weights/raven_model.pt"
}
```
CM-GRPO can be loaded in any of three equivalent forms:
Adapter only (`cmgrpo_raven_lora.safetensors`):
```jsonc
"backbone": {
"weight": "/path/to/RAVEN-weights/raven_model.pt",
"lora": {
"enabled": true,
"weight": "/path/to/RAVEN-weights/cmgrpo_raven_lora.safetensors"
}
}
```
Base + LoRA bundle (`cmgrpo_raven_full.pt`):
```jsonc
"backbone": {
"lora": {
"enabled": true,
"weight": "/path/to/RAVEN-weights/cmgrpo_raven_full.pt"
}
}
```
Merged backbone (`cmgrpo_raven_merge.pt`):
```jsonc
"backbone": {
"weight": "/path/to/RAVEN-weights/cmgrpo_raven_merge.pt"
}
```
The released CM-GRPO configs use the base + LoRA bundle form by default.
Reference configs:
```bash
configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc
```
Run qualitative generation:
```bash
bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc
```
Run VBench prompt-suite sampling:
```bash
bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc
```
## Requirements
The released configs depend on the RAVEN codebase and the upstream Wan2.1-T2V-1.3B components, including:
- Wan2.1-T2V-1.3B diffusion backbone / DiT config
- Wan2.1 VAE
- UMT5-XXL tokenizer and text encoder
- Python 3.10
- CUDA 12.8
- PyTorch 2.11 + cu128
- flash-attention 2/3 and magi-attention as built by `tools/prepare_venv.sh`
See the code repository README for full setup and evaluation instructions.
## License
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the `LICENSE` file in the code repository for details.
The upstream Wan2.1 components are subject to their own licenses and terms. Users are responsible for complying with all applicable licenses for the base model, code, data, and dependencies.
## Citation
If you find this work useful, please cite RAVEN. A BibTeX entry will be added when available.
```bibtex
@article{lu2026raven,
title = {RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO},
author = {Lu, Yanzuo and Zuo, Ronglai and Deng, Jiankang},
year = 2026,
journal = {arXiv preprint arXiv:2605.15190}
}
```
|