AaronHuangWei's picture
Upload README.md
4aa1b02 verified
---
license: apache-2.0
pipeline_tag: text-to-video
tags:
- text-to-video
- video-generation
- diffusion
- long-video
- longlive2
- wan2.2
- nvfp4
- 2-step
---
# LongLive2.0 5B NVFP4 2-Step Checkpoint
This repository hosts the LongLive2.0 5B NVFP4 2-step checkpoint for inference
with the LongLive2.0 release code:
https://github.com/wileewang/LongLive2.0
LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the
few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the
generator with NVFP4 weight quantization plus optional FP4 KV-cache
quantization.
## Installation
The NVFP4 path uses a stricter environment than the default BF16 release path.
We recommend keeping it in a separate conda environment.
```bash
git clone https://github.com/wileewang/LongLive2.0.git
cd LongLive2.0
conda create -n longlive2_nvfp4 python=3.12 -y
conda activate longlive2_nvfp4
pip install -r requirements.txt
pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
torch==2.10.0 torchvision==0.25.0
```
Build the NVFP4 / FP4 extensions:
```bash
cd fouroversix
pip install ninja packaging psutil "setuptools>=77.0.3"
# B200 / GB200 / GB300
export CUDA_ARCHS=100
# RTX 50/60 series, if needed
# export CUDA_ARCHS=120
pip install --no-build-isolation -e .
cd ..
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.8.3
pip install -U pip setuptools wheel ninja packaging
pip install --no-build-isolation -e .
cd ..
cd utils/kernel
python setup.py build_ext --inplace
cd ../..
```
Quick environment check:
```bash
python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)"
python -c "import flash_attn; print(flash_attn.__version__)"
python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4"
python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4"
```
The released LongLive2.0 checkpoint is sufficient for standard inference. You
only need to download the original Wan2.2-TI2V-5B components if you want to run
training, initialize from the original Wan weights, or use code paths that
explicitly load the base Wan model files:
```bash
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \
--local-dir wan_models/Wan2.2-TI2V-5B
```
Download this checkpoint repository:
```bash
huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-2Step \
--local-dir checkpoints/longlive2_5b_nvfp4_2step
```
## Configure Inference
Edit `configs/nvfp4/inference_nvfp4.yaml`.
For the released 2-step NVFP4 checkpoint, keep
`inference.sampling_steps: 2`:
```yaml
checkpoints:
generator_ckpt: checkpoints/longlive2_5b_nvfp4_2step/path/to/generator.pt
lora_ckpt: null
merge_lora: false
data:
data_path: /path/to/inference_prompts
image_or_video_shape:
- 1
- 384
- 48
- 44
- 80
output_folder: videos/longlive2_nvfp4_2step
num_samples: 1
num_output_frames: 384
inference:
sampling_steps: 2
sink_size: 8
guidance_scale: 1.0
multi_shot_sink: true
multi_shot_rope_offset: 8
kv_quant: true
kv_quant_scale_rule: mse
kv_quant_backend: cuda
streaming_vae: false
async_vae: false
vae_type: wan
model_quant: true
model_quant_use_transformer_engine: false
model_quant_scale_rule: mse
model_quant_activation_scale_rule: mse
model_quant_weight_scale_rule: mse
model_quant_gradient_scale_rule: mse
```
Replace the checkpoint filename above with the actual file in this repository.
If this repository contains a separate DMD LoRA checkpoint instead of a merged
generator, set `checkpoints.lora_ckpt` to that LoRA file and set
`merge_lora: true`, then add the LoRA adapter config:
```yaml
adapter:
type: lora
rank: 128
alpha: 128
dropout: 0.0
dtype: bfloat16
apply_to_critic: true
verbose: true
```
If `checkpoints.lora_ckpt` is `null`, remove the `adapter` section.
Do not set `model_quant_use_transformer_engine: true` when loading a FourOverSix
materialized NVFP4 checkpoint. FourOverSix checkpoints store
`quantized_weight_*` buffers and should be loaded through the FourOverSix path.
## Prompt Folder
`data.data_path` can be either:
- a `.txt` file, where each line is one single-shot prompt; or
- a directory of multi-shot prompt folders.
Example multi-shot prompt folder:
```text
inference_prompts/
robot_lab_demo/
0.json
1.json
2.json
shot_durations.txt
```
Each JSON file contains:
```json
{
"caption": "A compact silver robot with one blue optic explores a clean robotics lab."
}
```
`shot_durations.txt` is optional. If provided, each number is the number of
temporal chunks assigned to the corresponding caption, for example:
```text
2 2 4
```
## Run
Single node, 4 GPUs:
```bash
torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \
--config_path configs/nvfp4/inference_nvfp4.yaml
```
Single GPU:
```bash
python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml
```
Or use the helper script, which reads `NUM_GPUS` / `num_gpus` when provided:
```bash
scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml
```
Outputs are written to `output_folder`.
## Notes
- This model card is for the **2-step** NVFP4 checkpoint. Use
`inference.sampling_steps: 2`.
- `model_quant` enables NVFP4 generator inference.
- `inference.kv_quant` enables FP4 KV-cache storage and requires the
`utils/kernel` extension.
- `inference.multi_shot_sink` enables the multi-shot attention sink.
- `inference.multi_shot_rope_offset` controls the multi-shot RoPE offset.
- `inference.streaming_vae`, `inference.async_vae`, `inference.vae_type`, and
`inference.vae_device` control streaming or asynchronous VAE decode.