How should this model, FLUX.2-klein-4b-nvfp4, be used?
I tried replacing the corresponding model file in the transformer of FLUX.2 4b with this nvfp4 model file, but it still failed to run.
did you use the supported GPU? "NVFP4 (NVIDIA Floating Point 4) is a specialized 4-bit floating-point data format introduced with NVIDIA Blackwell GPUs"
I'm running the 9B model on my DGX-Spark.
Images are gorgeous, not too slow.
I am using qwen 4b for my text encoder
I'm running the 9B model on my DGX-Spark.
Images are gorgeous, not too slow.
It really helped me solving the problem. Please continue to write such useful and problem-solving posts. π
import torch
from diffusers import Flux2KleinPipeline, FluxTransformer2DModel
from huggingface_hub import hf_hub_download
dtype = torch.bfloat16
device = "cuda"
pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-4B",
torch_dtype=dtype
)
nvfp4_path = hf_hub_download(
repo_id="black-forest-labs/FLUX.2-klein-4b-nvfp4",
filename="flux-2-klein-4b-nvfp4.safetensors"
)
from safetensors.torch import load_file
state_dict = load_file(nvfp4_path)
pipe.transformer.load_state_dict(state_dict, strict=False)
pipe.to(device)
This works for me, but it provides no speed up just less memory consumed
did you use the supported GPU? "NVFP4 (NVIDIA Floating Point 4) is a specialized 4-bit floating-point data format introduced with NVIDIA Blackwell GPUs"
I encountered the same error when using the FP8 version.
Ok no this is wrong, my bad, this doesnt load the weights correctly
To anyone getting noisy/sand grain images: you must update to PyTorch 2.9.1+cu130 (Nightly) or newer and use CUDA 13.0 runtime to get clean images and native Blackwell speed. I'm currently getting clean results on my 5080.

