Other
Safetensors

VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Website | arXiv | GitHub | BibTeX

Official implementation and pre-trained models for:
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization, arXiv 2026
Andrei Atanov*, Jesse Allardice*, Roman Bachmann, Oğuzhan Fatih Kar, R Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir

Installation

For install instructions, please see https://github.com/apple/ml-videoflextok.

Usage

To load the VideoFlexTok model directly from HuggingFace Hub, call:

from videoflextok.wrappers import VideoFlexTokFromHub
model = VideoFlexTokFromHub.from_pretrained('EPFL-VILAB/videoflextok_d18_d18_k600').eval()

The model can also be loaded by downloading the model.safetensors checkpoint in this repository manually and loading it using our helper functions:

from hydra.utils import instantiate
from videoflextok.utils.checkpoint import load_safetensors

ckpt, config = load_safetensors('/path/to/model.safetensors')
model = instantiate(config).eval()
model.load_state_dict(ckpt)

After loading a VideoFlexTok model, videos can be encoded using:

from videoflextok.utils.demo import read_mp4
# Load example video into a float tensor of shape (3, 17, 128, 128), normalized to [-1,1]
video_tensor = read_mp4("./data/video_examples/red_ball.mp4", num_frames=17, **model.video_preprocess_args)  # (C, T, H, W)

# Encode into a list of discrete token sequences, where each sequence is of shape [1, 5, 128]
tokens_list = model.tokenize(video_tensor[None])

The list of token sequences can be truncated in a nested fashion:

k_keep = 64 # For example, only keep the first 64 out of 128 tokens for each timestep
tokens_list = [t[..., :k_keep] for t in tokens_list]

To decode the tokens with VideoFlexTok's rectified flow decoder, call:

# tokens_list is a list of [1, 5, l] discrete token sequences, with l <= 128
# reconst is a list of RGB videos of shape [1, 3, 17, 128, 128] tensor, normalized to [-1,1]
reconst = model.detokenize(
    tokens_list,
    timesteps=30, # Number of denoising steps
    guidance_scale=20., # Classifier-free guidance scale (15-30 typically works well)
    perform_norm_guidance=True, # See https://arxiv.org/abs/2410.02416
)

Citation

If you find this repository helpful, please consider citing our work:

@article{videoflextok,
    title={{VideoFlexTok}: Flexible-Length Coarse-to-Fine Video Tokenization},
    author={Andrei Atanov and Jesse Allardice and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Peter Fu and David Griffiths and Devon Hjelm and Afshin Dehghan and Amir Zamir},
    journal={arXiv 2604.12887},
    year={2026},
}

License

The model weights in this repository are released under the Apache License 2.0

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for EPFL-VILAB/videoflextok_d18_d18_k600