How should this model, FLUX.2-klein-4b-nvfp4, be used?

#1
by Lmonan - opened

I tried replacing the corresponding model file in the transformer of FLUX.2 4b with this nvfp4 model file, but it still failed to run.

It only produces noisy sand grain images with comfyui template workflow. The original (flux-2-klein-4b.safetensors) model just works fine.
image

did you use the supported GPU? "NVFP4 (NVIDIA Floating Point 4) is a specialized 4-bit floating-point data format introduced with NVIDIA Blackwell GPUs"

Something like that. JFK You can't even run nvfp4 without getting a huge pile of error messages on an unsupported card.
And the qwen_3_4b_fp4_mixed.safetensors CLIP Text Encoder file just work fine too.
image

I'm running the 9B model on my DGX-Spark.
Images are gorgeous, not too slow.
I am using qwen 4b for my text encoder

I'm running the 9B model on my DGX-Spark.
Images are gorgeous, not too slow.

It really helped me solving the problem. Please continue to write such useful and problem-solving posts. 😘

import torch
from diffusers import Flux2KleinPipeline, FluxTransformer2DModel
from huggingface_hub import hf_hub_download

dtype = torch.bfloat16
device = "cuda"

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=dtype
)

nvfp4_path = hf_hub_download(
    repo_id="black-forest-labs/FLUX.2-klein-4b-nvfp4",
    filename="flux-2-klein-4b-nvfp4.safetensors"
)


from safetensors.torch import load_file
state_dict = load_file(nvfp4_path)
pipe.transformer.load_state_dict(state_dict, strict=False)

pipe.to(device)

This works for me, but it provides no speed up just less memory consumed

did you use the supported GPU? "NVFP4 (NVIDIA Floating Point 4) is a specialized 4-bit floating-point data format introduced with NVIDIA Blackwell GPUs"

I encountered the same error when using the FP8 version.

Ok no this is wrong, my bad, this doesnt load the weights correctly

Then how to load nvfp4 weights on blackwell using python code with lora weights ? @satwato

To anyone getting noisy/sand grain images: you must update to PyTorch 2.9.1+cu130 (Nightly) or newer and use CUDA 13.0 runtime to get clean images and native Blackwell speed. I'm currently getting clean results on my 5080.

Sign up or log in to comment