Wan2.2-TI2V-5B-VedioQuant
Wan2.2 + TurboQuant cache compression = 10x less VRAM for video generation
This model packages Wan2.2-TI2V-5B (the current best open-source video model family) with VedioQuant cache compression pre-configured. Just load and use β no extra setup needed.
What's Different From Base Wan2.2?
| Base Wan2.2-5B | This Model (+ VedioQuant) | |
|---|---|---|
| Model weights | Unchanged | Identical (same weights) |
| Video quality | Baseline | 0.983 cosine similarity (< 2% loss) |
| 720P/81f 30-layer cache | 25.96 GB | 2.44 GB (10.6Γ smaller) |
| VRAM saved | β | 23.5 GB |
| Extra code needed | N/A | None (auto-enabled) |
The model weights are identical to the original Wan2.2. VedioQuant only compresses the inference cache (not the weights) using TurboQuant's random rotation + quantization technique, originally developed for LLM KV Cache compression.
Quick Start
import torch
from pipeline_vedioquant import VedioQuantPipeline
# Load β VedioQuant is auto-enabled
pipe = VedioQuantPipeline.from_pretrained(
"viberobin/Wan2.2-TI2V-5B-VedioQuant",
torch_dtype=torch.float16,
)
pipe.to("cuda") # or "mps" for Mac
# Generate video β same API as WanPipeline
video = pipe(
prompt="a cat sitting on a sofa, cinematic lighting",
num_frames=17,
height=480,
width=832,
).frames[0]
# Check compression stats
print(pipe.get_vedioquant_stats())
# β {'steps': 50, 'cache_hits': 32, 'hit_rate': '64%',
# 'bits': 3, 'compression_ratio': '10.7x'}
Configuration
# Adjust compression (default: 3-bit, recommended)
pipe.enable_vedioquant(bits=3, threshold=0.05)
# More aggressive compression (16x, slightly lower quality)
pipe.enable_vedioquant(bits=2)
# More conservative (8x, near-lossless)
pipe.enable_vedioquant(bits=4)
# Disable entirely (same as base Wan2.2)
pipe.disable_vedioquant()
Benchmark (Real Measurement on Wan2.2-TI2V-5B)
Tested with real Wan2.2-5B attention features (hidden_dim=3072, 30 layers):
| Config | Compression | Cosine Sim | Min Cosine | Compress | Decompress |
|---|---|---|---|---|---|
| 2-bit | 16.0Γ | 0.9396 | 0.9363 | 53ms | 16ms |
| 3-bit | 10.6Γ | 0.9827 | 0.9811 | 14ms | 14ms |
| 4-bit | 8.0Γ | 0.9953 | 0.9946 | 14ms | 12ms |
Gaussianization Verification
| Property | Before Rotation | After Rotation | Theory (Gaussian) |
|---|---|---|---|
| Kurtosis | 6.0 | 0.1 | 0.0 |
| Std dev | β | 0.018040 | 0.018042 |
Wan2.2 features are even more Gaussian-friendly than Wan2.1 (kurtosis 6 vs 15), resulting in excellent compression quality.
Memory Savings (720P, 81 frames, 30-layer full cache)
| fp32 | 3-bit VedioQuant | Saved | |
|---|---|---|---|
| Wan2.2-5B | 25.96 GB | 2.44 GB | 23.52 GB |
| Wan2.1-1.3B | 12.98 GB | 1.22 GB | 11.76 GB |
The 5B model has hidden_dim=3072 (vs 1536 for 1.3B), making cache compression even more impactful β saving 23.5 GB of VRAM.
How It Works
VedioQuant applies TurboQuant (Google Research, arXiv:2504.19874) to compress the feature cache used by TeaCache (arXiv:2411.19108) during video diffusion inference:
- Random orthogonal rotation transforms feature coordinates into a Gaussian distribution
- Precomputed Lloyd-Max codebook quantizes each coordinate from 32 bits to 3 bits
- Cache memory drops by 10.6Γ with only 1.7% quality loss (cosine similarity 0.983)
This is a cross-domain transfer: TurboQuant was designed for LLM KV Cache, but we discovered that video model attention features actually compress better than LLM features (kurtosis 6~15 vs 900).
Citation
@misc{vedioquant2025,
title={VedioQuant: Extreme Cache Compression for Video Diffusion Model Inference},
author={Peng Han},
year={2025},
url={https://github.com/robin-ph/vedioquant}
}
References
- TurboQuant: arXiv:2504.19874 (Google Research)
- TeaCache: arXiv:2411.19108 (CVPR 2025)
- Wan2.2: GitHub (Alibaba Cloud)
- VedioQuant: GitHub
- Downloads last month
- -
Model tree for viberobin/Wan2.2-TI2V-5B-VedioQuant
Base model
Wan-AI/Wan2.2-TI2V-5B-Diffusers