RAVEN / README.md

Update checkpoint documentation

c565789 about 10 hours ago

6.08 kB

license: cc-by-nc-4.0
library_name: pytorch
tags:
  - text-to-video
  - video-generation
  - diffusion
  - autoregressive
  - consistency-model
  - grpo
  - wan2.1
  - raven
base_model:
  - Wan-AI/Wan2.1-T2V-1.3B
pipeline_tag: text-to-video

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Yanzuo Lu · Ronglai Zuo · Jiankang Deng — Imperial College London

Project page: https://yanzuo.lu/raven

Overview

RAVEN is a causal autoregressive text-to-video generation model built on Wan2.1-T2V-1.3B. It is designed for real-time streaming video generation by extrapolating future video chunks from previously generated content.

The release contains the RAVEN checkpoint plus three interchangeable CM-GRPO variants:

File	Description
`raven_model.pt`	Full RAVEN backbone for causal autoregressive text-to-video generation.
`cmgrpo_raven_lora.safetensors`	CM-GRPO LoRA adapter only. Load `raven_model.pt` as the base weight and this file through the LoRA path.
`cmgrpo_raven_full.pt`	RAVEN base and CM-GRPO LoRA adapter packed into one PEFT-wrapped state dict. Load this file through the LoRA path without a separate base weight.
`cmgrpo_raven_merge.pt`	Full CM-GRPO backbone with the adapter already merged into RAVEN. Load this file as the base weight, with no LoRA block.

RAVEN trains a causal video generator using a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This aligns the model's training attention pattern with inference-time autoregressive extrapolation and allows downstream chunk losses to supervise the historical representations used for future predictions.

We also release CM-GRPO weights. CM-GRPO formulates a consistency-model sampling step as a conditional Gaussian transition and applies online Group Relative Policy Optimization directly to this kernel.

Model details

Base architecture: Wan2.1-T2V-1.3B DiT
Task: text-to-video generation
Generation mode: causal autoregressive video extrapolation
Resolution used in released configs: 480 x 832
Frames: 81
FPS: 16
Sampling steps: 4
Sampler: consistency sampler
Schedule: linear interpolation schedule, v_lerp prediction type
Classifier-free guidance: not used; the guidance_scale=3.0 value in the configs is a placeholder for interface compatibility
Causal chunking: chunk_size=3, independent_first_chunk=3, sink=0, window_size=null
VAE stride: [4, 8, 8]
Latent channels: 16
DiT config: dim 1536, 30 layers, 12 heads, FFN dim 8960, text length 512

Usage

This repository only hosts the released model weights. Please use the RAVEN codebase for inference and evaluation:

git clone https://github.com/YanzuoLu/RAVEN.git
cd RAVEN

Set up the environment:

conda env create -f tools/environment.yaml
conda activate raven
bash tools/prepare_venv.sh
source venv/bin/activate

Download this model repository:

hf download mvp-lab/RAVEN --local-dir /path/to/RAVEN-weights

Then point the relevant config files to the downloaded checkpoints. RAVEN itself (raven_model.pt) is a single full backbone:

"backbone": {
    "weight": "/path/to/RAVEN-weights/raven_model.pt"
}

CM-GRPO can be loaded in any of three equivalent forms:

Adapter only (cmgrpo_raven_lora.safetensors):

"backbone": {
    "weight": "/path/to/RAVEN-weights/raven_model.pt",
    "lora": {
        "enabled": true,
        "weight": "/path/to/RAVEN-weights/cmgrpo_raven_lora.safetensors"
    }
}

Base + LoRA bundle (cmgrpo_raven_full.pt):

"backbone": {
    "lora": {
        "enabled": true,
        "weight": "/path/to/RAVEN-weights/cmgrpo_raven_full.pt"
    }
}

Merged backbone (cmgrpo_raven_merge.pt):

"backbone": {
    "weight": "/path/to/RAVEN-weights/cmgrpo_raven_merge.pt"
}

The released CM-GRPO configs use the base + LoRA bundle form by default.

Reference configs:

configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc

Run qualitative generation:

bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/raven_baseline_prompts.jsonc
bash tools/multi_run.sh configs/trials/generate_t2v/causal_wan2.1_1.3B_t2v/cmgrpo_baseline_prompts.jsonc

Run VBench prompt-suite sampling:

bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/raven.jsonc
bash tools/multi_run.sh configs/trials/vbench_t2v/causal_wan2.1_1.3B_t2v/cmgrpo.jsonc

Requirements

The released configs depend on the RAVEN codebase and the upstream Wan2.1-T2V-1.3B components, including:

Wan2.1-T2V-1.3B diffusion backbone / DiT config
Wan2.1 VAE
UMT5-XXL tokenizer and text encoder
Python 3.10
CUDA 12.8
PyTorch 2.11 + cu128
flash-attention 2/3 and magi-attention as built by tools/prepare_venv.sh

See the code repository README for full setup and evaluation instructions.

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the LICENSE file in the code repository for details.

The upstream Wan2.1 components are subject to their own licenses and terms. Users are responsible for complying with all applicable licenses for the base model, code, data, and dependencies.

Citation

If you find this work useful, please cite RAVEN. A BibTeX entry will be added when available.

@article{lu2026raven,
  title = {RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO},
  author = {Lu, Yanzuo and Zuo, Ronglai and Deng, Jiankang},
  year = 2026,
  journal = {arXiv preprint arXiv:2605.15190}
}