Github | Habr article | Project Page | Technical Report (soon)

KVAE 2.0: Video tokenizers

KVAE 2.0 and previous KVAE 1.0 are familys of video and image tokenizers with spacial compression ratio of 8 and 16 and for video models with a time compression ratio of 4

KVAE-3D-2.0-t4s16

Model KVAE-3D-2.0-t4s16 has time compression 4 and spacial compression 16x16

Evaluation of reconstruction

For the test, open datasets MCL-JCV (video in 1280x720 resolution) and BVI-DVC were used. Wan-2.2 and HunyuanVideo-1.5 were considered as alternatives for the 4x16x16 format. For the HunyuanVideo model, due to the presence of the full attention block, tiling (default parameters) was used. Below are the results of a comparison using the PSNR, SSIM, and LPIPS metrics (with features from AlexNet).

Reconstruction comparison of KVAE 2.0, Hunyuan 1.5 and Wan 2.2

Evaluation of latent space qualities for generation model

The purpose of the tokenizer is to create a latent space for the generative model, so its superiority can only be established by evaluating the quality of the generations. To do this, we directly compared models (side-by-side, SBS) with the participation of several users. Each was shown pairs of images created for the same query. People evaluated each pair according to three characteristics: adherence to promptness, visual and semantic quality. Quite a lot of marked-up pairs allow you to establish a better-worse relationship between a pair of models. The honesty of the comparison is ensured by a fixed training dataset for the generative model, its architecture, as well as the learning strategy (optimizer parameters, number of steps, batch size, and other hyperparameters). Below are the results of two SBS with KVAE-2.0 4x16x16:

Inference instruction

Installation

Clone the repo:

git clone https://github.com/kandinskylab/kvae.git
cd kvae

Create environment with torch==2.8.0 с CUDA 12.8

conda create -n kvae_inference python=3.11
conda activate kvae_inference
pip install -r requirements.txt

KVAE inference

To run an image model on some dataset to calculate metrics, you can use the script:

PYTHONPATH=. python scripts/inference_2d_kvae.py --dataset_folder ./assets/images/ --model KVAE_1.0

To run video models:

PYTHONPATH=. python scripts/inference_3d_kvae.py --dataset_folder ./assets/test1/ --model KVAE_2.0-t4s8

If you want to save the reconstructions, then set the parameter --saving_folder with the folder to save ./your_path/. Please note that this will affect the running time, especially of the video model, even though saving works asynchronously with the rest of the components.

More detailed example of work with models is presented in inference_examples.ipynb

To use the library mediapy, you will need to install ffmpeg:

conda install -c conda-forge ffmpeg
pip install -q mediapy

Model Zoo

Collection KVAE 1.0 featured 2 models for tokenizing videos and images with spacial compression ratio of 8. The collection KVAE 2.0 features 2 models, both for video tokenization, but with varying spacial compression ratio of 8 and 16, respectively. Below are links to all models KVAE

Model	Data type	time compresion	spacial compresion	Checkpoint
KVAE-3D-2.0-t4s8	video	4	8	🤗 HF
KVAE-3D-2.0-t4s16	video	4	16	🤗 HF
KVAE-3D-1.0	video	4	8	🤗 HF
KVAE-2D-1.0	image	-	8	🤗 HF

Downloads last month: 90

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kandinskylab/KVAE-3D-2.0-t4s16

KVAE 2.0

Collection

KVAE 2.0 is a family of video tokenizers with a time compression ratio of 4 and spacial compression ratio of 8 and 16 • 2 items • Updated 4 days ago • 2