How should this model, FLUX.2-klein-4b-nvfp4, be used?

by Lmonan - opened Jan 21

Jan 21

I tried replacing the corresponding model file in the transformer of FLUX.2 4b with this nvfp4 model file, but it still failed to run.

EEducatio

Jan 23

It only produces noisy sand grain images with comfyui template workflow. The original (flux-2-klein-4b.safetensors) model just works fine.

HuggingJady

Jan 24

did you use the supported GPU? "NVFP4 (NVIDIA Floating Point 4) is a specialized 4-bit floating-point data format introduced with NVIDIA Blackwell GPUs"

EEducatio

Jan 24

•

edited Jan 24

Something like that. JFK You can't even run nvfp4 without getting a huge pile of error messages on an unsupported card.
And the qwen_3_4b_fp4_mixed.safetensors CLIP Text Encoder file just work fine too.

skullaria

Jan 24

•

edited Jan 24

I'm running the 9B model on my DGX-Spark.
Images are gorgeous, not too slow.
I am using qwen 4b for my text encoder

EEducatio

Jan 24

I'm running the 9B model on my DGX-Spark.
Images are gorgeous, not too slow.

It really helped me solving the problem. Please continue to write such useful and problem-solving posts. 😘

satwato

Jan 25

import torch
from diffusers import Flux2KleinPipeline, FluxTransformer2DModel
from huggingface_hub import hf_hub_download

dtype = torch.bfloat16
device = "cuda"

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=dtype
)

nvfp4_path = hf_hub_download(
    repo_id="black-forest-labs/FLUX.2-klein-4b-nvfp4",
    filename="flux-2-klein-4b-nvfp4.safetensors"
)


from safetensors.torch import load_file
state_dict = load_file(nvfp4_path)
pipe.transformer.load_state_dict(state_dict, strict=False)

pipe.to(device)

This works for me, but it provides no speed up just less memory consumed

Lmonan

Jan 25

did you use the supported GPU? "NVFP4 (NVIDIA Floating Point 4) is a specialized 4-bit floating-point data format introduced with NVIDIA Blackwell GPUs"

I encountered the same error when using the FP8 version.

satwato

Jan 25

Ok no this is wrong, my bad, this doesnt load the weights correctly

kartavyabagga

Jan 29

•

edited Jan 29

Then how to load nvfp4 weights on blackwell using python code with lora weights ? @satwato

JahnStar

Feb 18

•

edited Feb 18

To anyone getting noisy/sand grain images: you must update to PyTorch 2.9.1+cu130 (Nightly) or newer and use CUDA 13.0 runtime to get clean images and native Blackwell speed. I'm currently getting clean results on my 5080.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment