--- license: other license_name: nvidia-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ pipeline_tag: text-to-video tags: - text-to-video - multi-shot - NVFP4 - video-generation - diffusion - long-video - longlive2 - wan2.2 ---

LongLive2.0 logo

# LongLive2.0 5B NVFP4 Denoising Step 2 [![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](https://arxiv.org/abs/2605.18739) [![Code](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/NVlabs/LongLive) [![Video](https://img.shields.io/badge/YouTube-Video-red)](https://www.youtube.com/watch?v=7oQALy32fiU) [![Models](https://img.shields.io/badge/Model-BF16-yellow)](https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B) [![Models](https://img.shields.io/badge/Model-NVFP4-orange)](https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4) [![Demo](https://img.shields.io/badge/Demo-Page-brightgreen)](https://nvlabs.github.io/LongLive/LongLive2/) [![Docs](https://img.shields.io/badge/Full-Documentation-green)](https://nvlabs.github.io/LongLive/LongLive2/docs/) This repository hosts the LongLive2.0 5B NVFP4 denoising step 2 checkpoint for inference with the LongLive2.0 release code: https://github.com/NVlabs/LongLive LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the generator with NVFP4 weight quantization plus optional FP4 KV-cache quantization. ## Installation The NVFP4 path uses a stricter environment than the default BF16 release path. We recommend keeping it in a separate conda environment. ```bash git clone https://github.com/wileewang/LongLive2.0.git cd LongLive2.0 conda create -n longlive2_nvfp4 python=3.12 -y conda activate longlive2_nvfp4 pip install -r requirements.txt pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \ torch==2.10.0 torchvision==0.25.0 ``` Build the NVFP4 / FP4 extensions: ```bash cd fouroversix pip install ninja packaging psutil "setuptools>=77.0.3" # B200 / GB200 / GB300 export CUDA_ARCHS=100 # RTX 50/60 series, if needed # export CUDA_ARCHS=120 pip install --no-build-isolation -e . cd .. git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention git checkout v2.8.3 pip install -U pip setuptools wheel ninja packaging pip install --no-build-isolation -e . cd .. cd utils/kernel python setup.py build_ext --inplace cd ../.. ``` Quick environment check: ```bash python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)" python -c "import flash_attn; print(flash_attn.__version__)" python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4" python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4" ``` The released LongLive2.0 checkpoint is sufficient for standard inference. You only need to download the original Wan2.2-TI2V-5B components if you want to run training, initialize from the original Wan weights, or use code paths that explicitly load the base Wan model files: ```bash huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \ --local-dir wan_models/Wan2.2-TI2V-5B ``` Download this checkpoint repository: ```bash huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-2Step \ --local-dir checkpoints/longlive2_5b_nvfp4_2step ``` ## Configure Inference Edit `configs/nvfp4/inference_nvfp4.yaml`. For the released 2-step NVFP4 checkpoint, keep `inference.sampling_steps: 2`: ```yaml checkpoints: generator_ckpt: checkpoints/longlive2_5b_nvfp4_2step/path/to/generator.pt lora_ckpt: null merge_lora: false data: data_path: /path/to/inference_prompts image_or_video_shape: - 1 - 384 - 48 - 44 - 80 output_folder: videos/longlive2_nvfp4_2step num_samples: 1 num_output_frames: 384 inference: sampling_steps: 2 sink_size: 8 guidance_scale: 1.0 multi_shot_sink: true multi_shot_rope_offset: 8 kv_quant: true kv_quant_scale_rule: mse kv_quant_backend: cuda streaming_vae: false async_vae: false vae_type: wan model_quant: true model_quant_use_transformer_engine: false model_quant_scale_rule: mse model_quant_activation_scale_rule: mse model_quant_weight_scale_rule: mse model_quant_gradient_scale_rule: mse ``` Replace the checkpoint filename above with the actual file in this repository. If this repository contains a separate DMD LoRA checkpoint instead of a merged generator, set `checkpoints.lora_ckpt` to that LoRA file and set `merge_lora: true`, then add the LoRA adapter config: ```yaml adapter: type: lora rank: 128 alpha: 128 dropout: 0.0 dtype: bfloat16 apply_to_critic: true verbose: true ``` If `checkpoints.lora_ckpt` is `null`, remove the `adapter` section. Do not set `model_quant_use_transformer_engine: true` when loading a FourOverSix materialized NVFP4 checkpoint. FourOverSix checkpoints store `quantized_weight_*` buffers and should be loaded through the FourOverSix path. ## Prompt Folder `data.data_path` can be either: - a `.txt` file, where each line is one single-shot prompt; or - a directory of multi-shot prompt folders. Example multi-shot prompt folder: ```text inference_prompts/ robot_lab_demo/ 0.json 1.json 2.json shot_durations.txt ``` Each JSON file contains: ```json { "caption": "A compact silver robot with one blue optic explores a clean robotics lab." } ``` `shot_durations.txt` is optional. If provided, each number is the number of temporal chunks assigned to the corresponding caption, for example: ```text 2 2 4 ``` ## Run Single node, 4 GPUs: ```bash torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \ --config_path configs/nvfp4/inference_nvfp4.yaml ``` Single GPU: ```bash python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml ``` Or use the helper script, which reads `NUM_GPUS` / `num_gpus` when provided: ```bash scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml ``` Outputs are written to `output_folder`. ## Notes - This model card is for the **2-step** NVFP4 checkpoint. Use `inference.sampling_steps: 2`. - `model_quant` enables NVFP4 generator inference. - `inference.kv_quant` enables FP4 KV-cache storage and requires the `utils/kernel` extension. - `inference.multi_shot_sink` enables the multi-shot attention sink. - `inference.multi_shot_rope_offset` controls the multi-shot RoPE offset. - `inference.streaming_vae`, `inference.async_vae`, `inference.vae_type`, and `inference.vae_device` control streaming or asynchronous VAE decode. ## License/Terms of Use GOVERNING TERMS: This trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ## Citation ```bibtex @article{longlive_2, title={LongLive2.0: An NVFP4 Parallel Infrastructure for Long Video Generation}, author={Chen, Yukang and Wang, Luozhou and Huang, Wei and Yang, Shuai and Zhang, Bohan and Xiao, Yicheng and Chu, Ruihang and Mao, Weian and Hu, Qixin and Liu, Shaoteng and Zhao, Yuyang and Mao, Huizi and Chen, Ying-Cong and Xie, Enze and Qi, Xiaojuan and Han, Song}, journal={arXiv preprint arXiv}, year={2026} } ```