--- license: apache-2.0 pipeline_tag: text-to-video tags: - text-to-video - video-generation - diffusion - long-video - longlive2 - wan2.2 - nvfp4 - 2-step --- # LongLive2.0 5B NVFP4 2-Step Checkpoint This repository hosts the LongLive2.0 5B NVFP4 2-step checkpoint for inference with the LongLive2.0 release code: https://github.com/wileewang/LongLive2.0 LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the generator with NVFP4 weight quantization plus optional FP4 KV-cache quantization. ## Installation The NVFP4 path uses a stricter environment than the default BF16 release path. We recommend keeping it in a separate conda environment. ```bash git clone https://github.com/wileewang/LongLive2.0.git cd LongLive2.0 conda create -n longlive2_nvfp4 python=3.12 -y conda activate longlive2_nvfp4 pip install -r requirements.txt pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \ torch==2.10.0 torchvision==0.25.0 ``` Build the NVFP4 / FP4 extensions: ```bash cd fouroversix pip install ninja packaging psutil "setuptools>=77.0.3" # B200 / GB200 / GB300 export CUDA_ARCHS=100 # RTX 50/60 series, if needed # export CUDA_ARCHS=120 pip install --no-build-isolation -e . cd .. git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention git checkout v2.8.3 pip install -U pip setuptools wheel ninja packaging pip install --no-build-isolation -e . cd .. cd utils/kernel python setup.py build_ext --inplace cd ../.. ``` Quick environment check: ```bash python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)" python -c "import flash_attn; print(flash_attn.__version__)" python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4" python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4" ``` The released LongLive2.0 checkpoint is sufficient for standard inference. You only need to download the original Wan2.2-TI2V-5B components if you want to run training, initialize from the original Wan weights, or use code paths that explicitly load the base Wan model files: ```bash huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \ --local-dir wan_models/Wan2.2-TI2V-5B ``` Download this checkpoint repository: ```bash huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-2Step \ --local-dir checkpoints/longlive2_5b_nvfp4_2step ``` ## Configure Inference Edit `configs/nvfp4/inference_nvfp4.yaml`. For the released 2-step NVFP4 checkpoint, keep `inference.sampling_steps: 2`: ```yaml checkpoints: generator_ckpt: checkpoints/longlive2_5b_nvfp4_2step/path/to/generator.pt lora_ckpt: null merge_lora: false data: data_path: /path/to/inference_prompts image_or_video_shape: - 1 - 384 - 48 - 44 - 80 output_folder: videos/longlive2_nvfp4_2step num_samples: 1 num_output_frames: 384 inference: sampling_steps: 2 sink_size: 8 guidance_scale: 1.0 multi_shot_sink: true multi_shot_rope_offset: 8 kv_quant: true kv_quant_scale_rule: mse kv_quant_backend: cuda streaming_vae: false async_vae: false vae_type: wan model_quant: true model_quant_use_transformer_engine: false model_quant_scale_rule: mse model_quant_activation_scale_rule: mse model_quant_weight_scale_rule: mse model_quant_gradient_scale_rule: mse ``` Replace the checkpoint filename above with the actual file in this repository. If this repository contains a separate DMD LoRA checkpoint instead of a merged generator, set `checkpoints.lora_ckpt` to that LoRA file and set `merge_lora: true`, then add the LoRA adapter config: ```yaml adapter: type: lora rank: 128 alpha: 128 dropout: 0.0 dtype: bfloat16 apply_to_critic: true verbose: true ``` If `checkpoints.lora_ckpt` is `null`, remove the `adapter` section. Do not set `model_quant_use_transformer_engine: true` when loading a FourOverSix materialized NVFP4 checkpoint. FourOverSix checkpoints store `quantized_weight_*` buffers and should be loaded through the FourOverSix path. ## Prompt Folder `data.data_path` can be either: - a `.txt` file, where each line is one single-shot prompt; or - a directory of multi-shot prompt folders. Example multi-shot prompt folder: ```text inference_prompts/ robot_lab_demo/ 0.json 1.json 2.json shot_durations.txt ``` Each JSON file contains: ```json { "caption": "A compact silver robot with one blue optic explores a clean robotics lab." } ``` `shot_durations.txt` is optional. If provided, each number is the number of temporal chunks assigned to the corresponding caption, for example: ```text 2 2 4 ``` ## Run Single node, 4 GPUs: ```bash torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \ --config_path configs/nvfp4/inference_nvfp4.yaml ``` Single GPU: ```bash python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml ``` Or use the helper script, which reads `NUM_GPUS` / `num_gpus` when provided: ```bash scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml ``` Outputs are written to `output_folder`. ## Notes - This model card is for the **2-step** NVFP4 checkpoint. Use `inference.sampling_steps: 2`. - `model_quant` enables NVFP4 generator inference. - `inference.kv_quant` enables FP4 KV-cache storage and requires the `utils/kernel` extension. - `inference.multi_shot_sink` enables the multi-shot attention sink. - `inference.multi_shot_rope_offset` controls the multi-shot RoPE offset. - `inference.streaming_vae`, `inference.async_vae`, `inference.vae_type`, and `inference.vae_device` control streaming or asynchronous VAE decode.