Efficient-Large-Model
/

LongLive-2.0-5B-NVFP4-S2

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+pipeline_tag: text-to-video
+tags:
+  - text-to-video
+  - multi-shot
+  - NVFP4
+  - video-generation
+  - diffusion
+  - long-video
+  - longlive2
+  - wan2.2
+---
+<p align="center">
+  <img src="logo.png" alt="LongLive2.0 logo" width="100%">
+</p>
+# LongLive2.0 5B NVFP4 Denoising Step 2
+This repository hosts the LongLive2.0 5B NVFP4 denoising step 2 checkpoint for inference
+with the LongLive2.0 release code:
+https://github.com/NVlabs/LongLive
+LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the
+few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the
+generator with NVFP4 weight quantization plus optional FP4 KV-cache
+quantization.
+## Installation
+The NVFP4 path uses a stricter environment than the default BF16 release path.
+We recommend keeping it in a separate conda environment.
+```bash
+git clone https://github.com/wileewang/LongLive2.0.git
+cd LongLive2.0
+conda create -n longlive2_nvfp4 python=3.12 -y
+conda activate longlive2_nvfp4
+pip install -r requirements.txt
+pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
+  torch==2.10.0 torchvision==0.25.0
+```
+Build the NVFP4 / FP4 extensions:
+```bash
+cd fouroversix
+pip install ninja packaging psutil "setuptools>=77.0.3"
+# B200 / GB200 / GB300
+export CUDA_ARCHS=100
+# RTX 50/60 series, if needed
+# export CUDA_ARCHS=120
+pip install --no-build-isolation -e .
+cd ..
+git clone https://github.com/Dao-AILab/flash-attention.git
+cd flash-attention
+git checkout v2.8.3
+pip install -U pip setuptools wheel ninja packaging
+pip install --no-build-isolation -e .
+cd ..
+cd utils/kernel
+python setup.py build_ext --inplace
+cd ../..
+```
+Quick environment check:
+```bash
+python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)"
+python -c "import flash_attn; print(flash_attn.__version__)"
+python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4"
+python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4"
+```
+The released LongLive2.0 checkpoint is sufficient for standard inference. You
+only need to download the original Wan2.2-TI2V-5B components if you want to run
+training, initialize from the original Wan weights, or use code paths that
+explicitly load the base Wan model files:
+```bash
+huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \
+  --local-dir wan_models/Wan2.2-TI2V-5B
+```
+Download this checkpoint repository:
+```bash
+huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-2Step \
+  --local-dir checkpoints/longlive2_5b_nvfp4_2step
+```
+## Configure Inference
+Edit `configs/nvfp4/inference_nvfp4.yaml`.
+For the released 2-step NVFP4 checkpoint, keep
+`inference.sampling_steps: 2`:
+```yaml
+checkpoints:
+  generator_ckpt: checkpoints/longlive2_5b_nvfp4_2step/path/to/generator.pt
+  lora_ckpt: null
+merge_lora: false
+data:
+  data_path: /path/to/inference_prompts
+  image_or_video_shape:
+  - 1
+  - 384
+  - 48
+  - 44
+  - 80
+output_folder: videos/longlive2_nvfp4_2step
+num_samples: 1
+num_output_frames: 384
+inference:
+  sampling_steps: 2
+  sink_size: 8
+  guidance_scale: 1.0
+  multi_shot_sink: true
+  multi_shot_rope_offset: 8
+  kv_quant: true
+  kv_quant_scale_rule: mse
+  kv_quant_backend: cuda
+  streaming_vae: false
+  async_vae: false
+  vae_type: wan
+model_quant: true
+model_quant_use_transformer_engine: false
+model_quant_scale_rule: mse
+model_quant_activation_scale_rule: mse
+model_quant_weight_scale_rule: mse
+model_quant_gradient_scale_rule: mse
+```
+Replace the checkpoint filename above with the actual file in this repository.
+If this repository contains a separate DMD LoRA checkpoint instead of a merged
+generator, set `checkpoints.lora_ckpt` to that LoRA file and set
+`merge_lora: true`, then add the LoRA adapter config:
+```yaml
+adapter:
+  type: lora
+  rank: 128
+  alpha: 128
+  dropout: 0.0
+  dtype: bfloat16
+  apply_to_critic: true
+  verbose: true
+```
+If `checkpoints.lora_ckpt` is `null`, remove the `adapter` section.
+Do not set `model_quant_use_transformer_engine: true` when loading a FourOverSix
+materialized NVFP4 checkpoint. FourOverSix checkpoints store
+`quantized_weight_*` buffers and should be loaded through the FourOverSix path.
+## Prompt Folder
+`data.data_path` can be either:
+- a `.txt` file, where each line is one single-shot prompt; or
+- a directory of multi-shot prompt folders.
+Example multi-shot prompt folder:
+```text
+inference_prompts/
+  robot_lab_demo/
+    0.json
+    1.json
+    2.json
+    shot_durations.txt
+```
+Each JSON file contains:
+```json
+{
+  "caption": "A compact silver robot with one blue optic explores a clean robotics lab."
+}
+```
+`shot_durations.txt` is optional. If provided, each number is the number of
+temporal chunks assigned to the corresponding caption, for example:
+```text
+2 2 4
+```
+## Run
+Single node, 4 GPUs:
+```bash
+torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \
+  --config_path configs/nvfp4/inference_nvfp4.yaml
+```
+Single GPU:
+```bash
+python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml
+```
+Or use the helper script, which reads `NUM_GPUS` / `num_gpus` when provided:
+```bash
+scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml
+```
+Outputs are written to `output_folder`.
+## Notes
+- This model card is for the **2-step** NVFP4 checkpoint. Use
+  `inference.sampling_steps: 2`.
+- `model_quant` enables NVFP4 generator inference.
+- `inference.kv_quant` enables FP4 KV-cache storage and requires the
+  `utils/kernel` extension.
+- `inference.multi_shot_sink` enables the multi-shot attention sink.
+- `inference.multi_shot_rope_offset` controls the multi-shot RoPE offset.
+- `inference.streaming_vae`, `inference.async_vae`, `inference.vae_type`, and
+  `inference.vae_device` control streaming or asynchronous VAE decode.
+## License/Terms of Use
+GOVERNING TERMS: This trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
+## Citation
+```bibtex
+@article{longlive_2,
+  title={LongLive2.0: An NVFP4 Parallel Infrastructure for Long Video Generation},
+  author={Chen, Yukang and Wang, Luozhou and Huang, Wei and Yang, Shuai and Zhang, Bohan and Xiao, Yicheng and Chu, Ruihang and Mao, Weian and Hu, Qixin and Liu, Shaoteng and Zhao, Yuyang and Mao, Huizi and Chen, Ying-Cong and Xie, Enze and Qi, Xiaojuan and Han, Song},
+  journal={arXiv preprint arXiv},
+  year={2026}
+}
+```