File size: 6,079 Bytes

---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - text-to-video
  - video-generation
  - diffusion
  - autoregressive
  - consistency-model
  - grpo
  - wan2.1
  - raven
base_model:
  - Wan-AI/Wan2.1-T2V-1.3B
pipeline_tag: text-to-video
---

# RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

[Yanzuo Lu](https://yanzuo.lu/) · [Ronglai Zuo](https://2000zrl.github.io/) · [Jiankang Deng](https://jiankangdeng.github.io/) — Imperial College London

Project page: https://yanzuo.lu/raven  

## Overview

RAVEN is a causal autoregressive text-to-video generation model built on Wan2.1-T2V-1.3B. It is designed for real-time streaming video generation by extrapolating future video chunks from previously generated content.

The release contains the RAVEN checkpoint plus three interchangeable CM-GRPO variants:

| File | Description |
| --- | --- |
| `raven_model.pt` | Full RAVEN backbone for causal autoregressive text-to-video generation. |
| `cmgrpo_raven_lora.safetensors` | CM-GRPO LoRA adapter only. Load `raven_model.pt` as the base weight and this file through the LoRA path. |
| `cmgrpo_raven_full.pt` | RAVEN base and CM-GRPO LoRA adapter packed into one PEFT-wrapped state dict. Load this file through the LoRA path without a separate base weight. |
| `cmgrpo_raven_merge.pt` | Full CM-GRPO backbone with the adapter already merged into RAVEN. Load this file as the base weight, with no LoRA block. |

RAVEN trains a causal video generator using a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This aligns the model's training attention pattern with inference-time autoregressive extrapolation and allows downstream chunk losses to supervise the historical representations used for future predictions.

We also release CM-GRPO weights. CM-GRPO formulates a consistency-model sampling step as a conditional Gaussian transition and applies online Group Relative Policy Optimization directly to this kernel.

## Model details

- Base architecture: Wan2.1-T2V-1.3B DiT
- Task: text-to-video generation
- Generation mode: causal autoregressive video extrapolation
- Resolution used in released configs: 480 x 832
- Frames: 81
- FPS: 16
- Sampling steps: 4
- Sampler: consistency sampler
- Schedule: linear interpolation schedule, `v_lerp` prediction type
- Classifier-free guidance: not used; the `guidance_scale=3.0` value in the configs is a placeholder for interface compatibility
- Causal chunking: `chunk_size=3`, `independent_first_chunk=3`, `sink=0`, `window_size=null`
- VAE stride: `[4, 8, 8]`
- Latent channels: 16
- DiT config: dim 1536, 30 layers, 12 heads, FFN dim 8960, text length 512

## Usage

This repository only hosts the released model weights. Please use the RAVEN codebase for inference and evaluation:

```bash
git clone https://github.com/YanzuoLu/RAVEN.git
cd RAVEN
```

Set up the environment:

```bash
conda env create -f tools/environment.yaml
conda activate raven
bash tools/prepare_venv.sh
source venv/bin/activate
```

Download this model repository:

```bash
hf download mvp-lab/RAVEN --local-dir /path/to/RAVEN-weights
```

Then point the relevant config files to the downloaded checkpoints. RAVEN itself (`raven_model.pt`) is a single full backbone:

```jsonc
"backbone": {
    "weight": "/path/to/RAVEN-weights/raven_model.pt"
}
```

CM-GRPO can be loaded in any of three equivalent forms:

Adapter only (`cmgrpo_raven_lora.safetensors`):

```jsonc
"backbone": {
    "weight": "/path/to/RAVEN-weights/raven_model.pt",
    "lora": {
        "enabled": true,
        "weight": "/path/to/RAVEN-weights/cmgrpo_raven_lora.safetensors"
    }
}
```

Base + LoRA bundle (`cmgrpo_raven_full.pt`):

```jsonc
"backbone": {
    "lora": {
        "enabled": true,
        "weight": "/path/to/RAVEN-weights/cmgrpo_raven_full.pt"
    }
}
```

Merged backbone (`cmgrpo_raven_merge.pt`):

```jsonc
"backbone": {
    "weight": "/path/to/RAVEN-weights/cmgrpo_raven_merge.pt"
}
```

The released CM-GRPO configs use the base + LoRA bundle form by default.

Reference configs:

```bash
configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc
```

Run qualitative generation:

```bash
bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc
```

Run VBench prompt-suite sampling:

```bash
bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc
```

## Requirements

The released configs depend on the RAVEN codebase and the upstream Wan2.1-T2V-1.3B components, including:

- Wan2.1-T2V-1.3B diffusion backbone / DiT config
- Wan2.1 VAE
- UMT5-XXL tokenizer and text encoder
- Python 3.10
- CUDA 12.8
- PyTorch 2.11 + cu128
- flash-attention 2/3 and magi-attention as built by `tools/prepare_venv.sh`

See the code repository README for full setup and evaluation instructions.

## License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the `LICENSE` file in the code repository for details.

The upstream Wan2.1 components are subject to their own licenses and terms. Users are responsible for complying with all applicable licenses for the base model, code, data, and dependencies.

## Citation

If you find this work useful, please cite RAVEN. A BibTeX entry will be added when available.

```bibtex
@article{lu2026raven,
  title = {RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO},
  author = {Lu, Yanzuo and Zuo, Ronglai and Deng, Jiankang},
  year = 2026,
  journal = {arXiv preprint arXiv:2605.15190}
}
```