---
license: apache-2.0
pipeline_tag: text-to-video
tags:
  - text-to-video
  - video-generation
  - diffusion
  - long-video
  - longlive2
  - wan2.2
  - nvfp4
  - 2-step
---

# LongLive2.0 5B NVFP4 2-Step Checkpoint

This repository hosts the LongLive2.0 5B NVFP4 2-step checkpoint for inference
with the LongLive2.0 release code:

https://github.com/wileewang/LongLive2.0

LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the
few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the
generator with NVFP4 weight quantization plus optional FP4 KV-cache
quantization.

## Installation

The NVFP4 path uses a stricter environment than the default BF16 release path.
We recommend keeping it in a separate conda environment.

```bash
git clone https://github.com/wileewang/LongLive2.0.git
cd LongLive2.0

conda create -n longlive2_nvfp4 python=3.12 -y
conda activate longlive2_nvfp4

pip install -r requirements.txt
pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
  torch==2.10.0 torchvision==0.25.0
```

Build the NVFP4 / FP4 extensions:

```bash
cd fouroversix
pip install ninja packaging psutil "setuptools>=77.0.3"

# B200 / GB200 / GB300
export CUDA_ARCHS=100

# RTX 50/60 series, if needed
# export CUDA_ARCHS=120

pip install --no-build-isolation -e .
cd ..

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.8.3
pip install -U pip setuptools wheel ninja packaging
pip install --no-build-isolation -e .
cd ..

cd utils/kernel
python setup.py build_ext --inplace
cd ../..
```

Quick environment check:

```bash
python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)"
python -c "import flash_attn; print(flash_attn.__version__)"
python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4"
python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4"
```

The released LongLive2.0 checkpoint is sufficient for standard inference. You
only need to download the original Wan2.2-TI2V-5B components if you want to run
training, initialize from the original Wan weights, or use code paths that
explicitly load the base Wan model files:

```bash
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \
  --local-dir wan_models/Wan2.2-TI2V-5B
```

Download this checkpoint repository:

```bash
huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-2Step \
  --local-dir checkpoints/longlive2_5b_nvfp4_2step
```

## Configure Inference

Edit `configs/nvfp4/inference_nvfp4.yaml`.

For the released 2-step NVFP4 checkpoint, keep
`inference.sampling_steps: 2`:

```yaml
checkpoints:
  generator_ckpt: checkpoints/longlive2_5b_nvfp4_2step/path/to/generator.pt
  lora_ckpt: null

merge_lora: false

data:
  data_path: /path/to/inference_prompts
  image_or_video_shape:
  - 1
  - 384
  - 48
  - 44
  - 80

output_folder: videos/longlive2_nvfp4_2step
num_samples: 1
num_output_frames: 384

inference:
  sampling_steps: 2
  sink_size: 8
  guidance_scale: 1.0
  multi_shot_sink: true
  multi_shot_rope_offset: 8
  kv_quant: true
  kv_quant_scale_rule: mse
  kv_quant_backend: cuda
  streaming_vae: false
  async_vae: false
  vae_type: wan

model_quant: true
model_quant_use_transformer_engine: false
model_quant_scale_rule: mse
model_quant_activation_scale_rule: mse
model_quant_weight_scale_rule: mse
model_quant_gradient_scale_rule: mse
```

Replace the checkpoint filename above with the actual file in this repository.
If this repository contains a separate DMD LoRA checkpoint instead of a merged
generator, set `checkpoints.lora_ckpt` to that LoRA file and set
`merge_lora: true`, then add the LoRA adapter config:

```yaml
adapter:
  type: lora
  rank: 128
  alpha: 128
  dropout: 0.0
  dtype: bfloat16
  apply_to_critic: true
  verbose: true
```

If `checkpoints.lora_ckpt` is `null`, remove the `adapter` section.

Do not set `model_quant_use_transformer_engine: true` when loading a FourOverSix
materialized NVFP4 checkpoint. FourOverSix checkpoints store
`quantized_weight_*` buffers and should be loaded through the FourOverSix path.

## Prompt Folder

`data.data_path` can be either:

- a `.txt` file, where each line is one single-shot prompt; or
- a directory of multi-shot prompt folders.

Example multi-shot prompt folder:

```text
inference_prompts/
  robot_lab_demo/
    0.json
    1.json
    2.json
    shot_durations.txt
```

Each JSON file contains:

```json
{
  "caption": "A compact silver robot with one blue optic explores a clean robotics lab."
}
```

`shot_durations.txt` is optional. If provided, each number is the number of
temporal chunks assigned to the corresponding caption, for example:

```text
2 2 4
```

## Run

Single node, 4 GPUs:

```bash
torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \
  --config_path configs/nvfp4/inference_nvfp4.yaml
```

Single GPU:

```bash
python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml
```

Or use the helper script, which reads `NUM_GPUS` / `num_gpus` when provided:

```bash
scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml
```

Outputs are written to `output_folder`.

## Notes

- This model card is for the **2-step** NVFP4 checkpoint. Use
  `inference.sampling_steps: 2`.
- `model_quant` enables NVFP4 generator inference.
- `inference.kv_quant` enables FP4 KV-cache storage and requires the
  `utils/kernel` extension.
- `inference.multi_shot_sink` enables the multi-shot attention sink.
- `inference.multi_shot_rope_offset` controls the multi-shot RoPE offset.
- `inference.streaming_vae`, `inference.async_vae`, `inference.vae_type`, and
  `inference.vae_device` control streaming or asynchronous VAE decode.