Upload README.md

4aa1b02 verified 3 days ago

5.74 kB

	---
	license: apache-2.0
	pipeline_tag: text-to-video
	tags:
	- text-to-video
	- video-generation
	- diffusion
	- long-video
	- longlive2
	- wan2.2
	- nvfp4
	- 2-step
	---

	# LongLive2.0 5B NVFP4 2-Step Checkpoint

	This repository hosts the LongLive2.0 5B NVFP4 2-step checkpoint for inference
	with the LongLive2.0 release code:

	https://github.com/wileewang/LongLive2.0

	LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the
	few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the
	generator with NVFP4 weight quantization plus optional FP4 KV-cache
	quantization.

	## Installation

	The NVFP4 path uses a stricter environment than the default BF16 release path.
	We recommend keeping it in a separate conda environment.

	```bash
	git clone https://github.com/wileewang/LongLive2.0.git
	cd LongLive2.0

	conda create -n longlive2_nvfp4 python=3.12 -y
	conda activate longlive2_nvfp4

	pip install -r requirements.txt
	pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
	torch==2.10.0 torchvision==0.25.0
	```

	Build the NVFP4 / FP4 extensions:

	```bash
	cd fouroversix
	pip install ninja packaging psutil "setuptools>=77.0.3"

	# B200 / GB200 / GB300
	export CUDA_ARCHS=100

	# RTX 50/60 series, if needed
	# export CUDA_ARCHS=120

	pip install --no-build-isolation -e .
	cd ..

	git clone https://github.com/Dao-AILab/flash-attention.git
	cd flash-attention
	git checkout v2.8.3
	pip install -U pip setuptools wheel ninja packaging
	pip install --no-build-isolation -e .
	cd ..

	cd utils/kernel
	python setup.py build_ext --inplace
	cd ../..
	```

	Quick environment check:

	```bash
	python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)"
	python -c "import flash_attn; print(flash_attn.__version__)"
	python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4"
	python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4"
	```

	The released LongLive2.0 checkpoint is sufficient for standard inference. You
	only need to download the original Wan2.2-TI2V-5B components if you want to run
	training, initialize from the original Wan weights, or use code paths that
	explicitly load the base Wan model files:

	```bash
	huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \
	--local-dir wan_models/Wan2.2-TI2V-5B
	```

	Download this checkpoint repository:

	```bash
	huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-2Step \
	--local-dir checkpoints/longlive2_5b_nvfp4_2step
	```

	## Configure Inference

	Edit `configs/nvfp4/inference_nvfp4.yaml`.

	For the released 2-step NVFP4 checkpoint, keep
	`inference.sampling_steps: 2`:

	```yaml
	checkpoints:
	generator_ckpt: checkpoints/longlive2_5b_nvfp4_2step/path/to/generator.pt
	lora_ckpt: null

	merge_lora: false

	data:
	data_path: /path/to/inference_prompts
	image_or_video_shape:
	- 1
	- 384
	- 48
	- 44
	- 80

	output_folder: videos/longlive2_nvfp4_2step
	num_samples: 1
	num_output_frames: 384

	inference:
	sampling_steps: 2
	sink_size: 8
	guidance_scale: 1.0
	multi_shot_sink: true
	multi_shot_rope_offset: 8
	kv_quant: true
	kv_quant_scale_rule: mse
	kv_quant_backend: cuda
	streaming_vae: false
	async_vae: false
	vae_type: wan

	model_quant: true
	model_quant_use_transformer_engine: false
	model_quant_scale_rule: mse
	model_quant_activation_scale_rule: mse
	model_quant_weight_scale_rule: mse
	model_quant_gradient_scale_rule: mse
	```

	Replace the checkpoint filename above with the actual file in this repository.
	If this repository contains a separate DMD LoRA checkpoint instead of a merged
	generator, set `checkpoints.lora_ckpt` to that LoRA file and set
	`merge_lora: true`, then add the LoRA adapter config:

	```yaml
	adapter:
	type: lora
	rank: 128
	alpha: 128
	dropout: 0.0
	dtype: bfloat16
	apply_to_critic: true
	verbose: true
	```

	If `checkpoints.lora_ckpt` is `null`, remove the `adapter` section.

	Do not set `model_quant_use_transformer_engine: true` when loading a FourOverSix
	materialized NVFP4 checkpoint. FourOverSix checkpoints store
	`quantized_weight_*` buffers and should be loaded through the FourOverSix path.

	## Prompt Folder

	`data.data_path` can be either:

	- a `.txt` file, where each line is one single-shot prompt; or
	- a directory of multi-shot prompt folders.

	Example multi-shot prompt folder:

	```text
	inference_prompts/
	robot_lab_demo/
	0.json
	1.json
	2.json
	shot_durations.txt
	```

	Each JSON file contains:

	```json
	{
	"caption": "A compact silver robot with one blue optic explores a clean robotics lab."
	}
	```

	`shot_durations.txt` is optional. If provided, each number is the number of
	temporal chunks assigned to the corresponding caption, for example:

	```text
	2 2 4
	```

	## Run

	Single node, 4 GPUs:

	```bash
	torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \
	--config_path configs/nvfp4/inference_nvfp4.yaml
	```

	Single GPU:

	```bash
	python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml
	```

	Or use the helper script, which reads `NUM_GPUS` / `num_gpus` when provided:

	```bash
	scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml
	```

	Outputs are written to `output_folder`.

	## Notes

	- This model card is for the 2-step NVFP4 checkpoint. Use
	`inference.sampling_steps: 2`.
	- `model_quant` enables NVFP4 generator inference.
	- `inference.kv_quant` enables FP4 KV-cache storage and requires the
	`utils/kernel` extension.
	- `inference.multi_shot_sink` enables the multi-shot attention sink.
	- `inference.multi_shot_rope_offset` controls the multi-shot RoPE offset.
	- `inference.streaming_vae`, `inference.async_vae`, `inference.vae_type`, and
	`inference.vae_device` control streaming or asynchronous VAE decode.