This is the best model I've seen.It is almost real-time.

#8
by tonera - opened

This is the best model I've seen, a perfect fit for svdquant. Quantization accuracy is virtually lossless. Here are the quantization results:

Metric Mean Median p50 p90
PSNR 22.10 22.35 25.58
SSIM 0.876 0.877 0.933
LPIPS 0.0734 0.0714 0.114

The quantized version uses only 15.21GB of VRAM at its peak, and on an RTX 5090, the texturing is almost real-time.

Performance benchmarks (RTX 5090 32GB, 8 steps, guidance scale = 1.0)

Image editing (1024x1024)

Setup Mode Peak VRAM Throughput Time to image Throughput change vs. base VRAM change vs. base
Base CUDA OOM - - Baseline unavailable Baseline unavailable
Base MCO 20.18 GB 0.62 it/s 12 s Baseline Baseline
Base SCO 2.55 GB 0.48 it/s 16 s Baseline Baseline
Base + TE CUDA 25.60 GB 1.28 it/s 6 s N/A (base OOM) N/A
Base + TE MCO 20.16 GB 0.83 it/s 9 s +33.9% -0.1%
Base + TE SCO 6.08 GB 0.48 it/s 16 s -0.5% +138.4%
Base + TR CUDA 24.51 GB 3.79 it/s 2 s N/A (base OOM) N/A
Base + TR MCO 17.39 GB 1.08 it/s 7 s +75.0% -13.8%
Base + TR SCO 4.35 GB 2.70 it/s 2 s +461.6% +70.6%
Base + TR + TE CUDA 14.00 GB 3.81 it/s 2 s N/A (base OOM) N/A
Base + TR + TE MCO 7.52 GB 1.88 it/s 4 s +204.6% -62.7%
Base + TR + TE SCO 7.69 GB 2.68 it/s 2 s +457.4% +201.6%

Text-to-image (1024x1024)

Setup Mode Peak VRAM Throughput Time to image Throughput change vs. base VRAM change vs. base
Base CUDA OOM - - Baseline unavailable Baseline unavailable
Base MCO 18.53 GB 0.83 it/s 9 s Baseline Baseline
Base SCO 2.55 GB 0.62 it/s 12 s Baseline Baseline
Base + TR + TE CUDA 15.21 GB 8.91 it/s <1 s N/A (base OOM) N/A
Base + TR + TE MCO 6.42 GB 2.60 it/s 3 s +214.5% -65.4%
Base + TR + TE SCO 7.72 GB 3.00 it/s 2 s +383.0% +202.7%
  • Base = black-forest-labs/FLUX.2-klein-9b-kv
  • TE = tonera/Qwen3-text-Nunchaku
  • TR = tonera/FLUX.2-klein-9b-kv-Nunchaku/svdq-{precision}_r32-FLUX.2-klein-9b-kv-Nunchaku.safetensors
  • MCO = enable-model-cpu-offload
  • SCO = enable-sequential-cpu-offload

Sign up or log in to comment