| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - text-to-video |
| - video-generation |
| - diffusion |
| - autoregressive |
| - consistency-model |
| - grpo |
| - wan2.1 |
| - raven |
| base_model: |
| - Wan-AI/Wan2.1-T2V-1.3B |
| pipeline_tag: text-to-video |
| --- |
| |
| # RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO |
|
|
| [Yanzuo Lu](https://yanzuo.lu/) · [Ronglai Zuo](https://2000zrl.github.io/) · [Jiankang Deng](https://jiankangdeng.github.io/) — Imperial College London |
|
|
| Project page: https://yanzuo.lu/raven |
|
|
| ## Overview |
|
|
| RAVEN is a causal autoregressive text-to-video generation model built on Wan2.1-T2V-1.3B. It is designed for real-time streaming video generation by extrapolating future video chunks from previously generated content. |
|
|
| The release contains the RAVEN checkpoint plus three interchangeable CM-GRPO variants: |
|
|
| | File | Description | |
| | --- | --- | |
| | `raven_model.pt` | Full RAVEN backbone for causal autoregressive text-to-video generation. | |
| | `cmgrpo_raven_lora.safetensors` | CM-GRPO LoRA adapter only. Load `raven_model.pt` as the base weight and this file through the LoRA path. | |
| | `cmgrpo_raven_full.pt` | RAVEN base and CM-GRPO LoRA adapter packed into one PEFT-wrapped state dict. Load this file through the LoRA path without a separate base weight. | |
| | `cmgrpo_raven_merge.pt` | Full CM-GRPO backbone with the adapter already merged into RAVEN. Load this file as the base weight, with no LoRA block. | |
|
|
| RAVEN trains a causal video generator using a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This aligns the model's training attention pattern with inference-time autoregressive extrapolation and allows downstream chunk losses to supervise the historical representations used for future predictions. |
|
|
| We also release CM-GRPO weights. CM-GRPO formulates a consistency-model sampling step as a conditional Gaussian transition and applies online Group Relative Policy Optimization directly to this kernel. |
|
|
| ## Model details |
|
|
| - Base architecture: Wan2.1-T2V-1.3B DiT |
| - Task: text-to-video generation |
| - Generation mode: causal autoregressive video extrapolation |
| - Resolution used in released configs: 480 x 832 |
| - Frames: 81 |
| - FPS: 16 |
| - Sampling steps: 4 |
| - Sampler: consistency sampler |
| - Schedule: linear interpolation schedule, `v_lerp` prediction type |
| - Classifier-free guidance: not used; the `guidance_scale=3.0` value in the configs is a placeholder for interface compatibility |
| - Causal chunking: `chunk_size=3`, `independent_first_chunk=3`, `sink=0`, `window_size=null` |
| - VAE stride: `[4, 8, 8]` |
| - Latent channels: 16 |
| - DiT config: dim 1536, 30 layers, 12 heads, FFN dim 8960, text length 512 |
|
|
| ## Usage |
|
|
| This repository only hosts the released model weights. Please use the RAVEN codebase for inference and evaluation: |
|
|
| ```bash |
| git clone https://github.com/YanzuoLu/RAVEN.git |
| cd RAVEN |
| ``` |
|
|
| Set up the environment: |
|
|
| ```bash |
| conda env create -f tools/environment.yaml |
| conda activate raven |
| bash tools/prepare_venv.sh |
| source venv/bin/activate |
| ``` |
|
|
| Download this model repository: |
|
|
| ```bash |
| hf download mvp-lab/RAVEN --local-dir /path/to/RAVEN-weights |
| ``` |
|
|
| Then point the relevant config files to the downloaded checkpoints. RAVEN itself (`raven_model.pt`) is a single full backbone: |
|
|
| ```jsonc |
| "backbone": { |
| "weight": "/path/to/RAVEN-weights/raven_model.pt" |
| } |
| ``` |
|
|
| CM-GRPO can be loaded in any of three equivalent forms: |
|
|
| Adapter only (`cmgrpo_raven_lora.safetensors`): |
|
|
| ```jsonc |
| "backbone": { |
| "weight": "/path/to/RAVEN-weights/raven_model.pt", |
| "lora": { |
| "enabled": true, |
| "weight": "/path/to/RAVEN-weights/cmgrpo_raven_lora.safetensors" |
| } |
| } |
| ``` |
|
|
| Base + LoRA bundle (`cmgrpo_raven_full.pt`): |
|
|
| ```jsonc |
| "backbone": { |
| "lora": { |
| "enabled": true, |
| "weight": "/path/to/RAVEN-weights/cmgrpo_raven_full.pt" |
| } |
| } |
| ``` |
|
|
| Merged backbone (`cmgrpo_raven_merge.pt`): |
|
|
| ```jsonc |
| "backbone": { |
| "weight": "/path/to/RAVEN-weights/cmgrpo_raven_merge.pt" |
| } |
| ``` |
|
|
| The released CM-GRPO configs use the base + LoRA bundle form by default. |
|
|
| Reference configs: |
|
|
| ```bash |
| configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc |
| configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc |
| configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc |
| configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc |
| ``` |
|
|
| Run qualitative generation: |
|
|
| ```bash |
| bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc |
| bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc |
| ``` |
|
|
| Run VBench prompt-suite sampling: |
|
|
| ```bash |
| bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc |
| bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc |
| ``` |
|
|
| ## Requirements |
|
|
| The released configs depend on the RAVEN codebase and the upstream Wan2.1-T2V-1.3B components, including: |
|
|
| - Wan2.1-T2V-1.3B diffusion backbone / DiT config |
| - Wan2.1 VAE |
| - UMT5-XXL tokenizer and text encoder |
| - Python 3.10 |
| - CUDA 12.8 |
| - PyTorch 2.11 + cu128 |
| - flash-attention 2/3 and magi-attention as built by `tools/prepare_venv.sh` |
|
|
| See the code repository README for full setup and evaluation instructions. |
|
|
| ## License |
|
|
| This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the `LICENSE` file in the code repository for details. |
|
|
| The upstream Wan2.1 components are subject to their own licenses and terms. Users are responsible for complying with all applicable licenses for the base model, code, data, and dependencies. |
|
|
| ## Citation |
|
|
| If you find this work useful, please cite RAVEN. A BibTeX entry will be added when available. |
|
|
| ```bibtex |
| @article{lu2026raven, |
| title = {RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO}, |
| author = {Lu, Yanzuo and Zuo, Ronglai and Deng, Jiankang}, |
| year = 2026, |
| journal = {arXiv preprint arXiv:2605.15190} |
| } |
| ``` |
|
|