This is the best model I've seen.It is almost real-time.
#8
by tonera - opened
This is the best model I've seen, a perfect fit for svdquant. Quantization accuracy is virtually lossless. Here are the quantization results:
| Metric | Mean | Median p50 | p90 |
|---|---|---|---|
| PSNR | 22.10 | 22.35 | 25.58 |
| SSIM | 0.876 | 0.877 | 0.933 |
| LPIPS | 0.0734 | 0.0714 | 0.114 |
The quantized version uses only 15.21GB of VRAM at its peak, and on an RTX 5090, the texturing is almost real-time.
Performance benchmarks (RTX 5090 32GB, 8 steps, guidance scale = 1.0)
Image editing (1024x1024)
| Setup | Mode | Peak VRAM | Throughput | Time to image | Throughput change vs. base | VRAM change vs. base |
|---|---|---|---|---|---|---|
Base |
CUDA |
OOM | - | - | Baseline unavailable | Baseline unavailable |
Base |
MCO |
20.18 GB | 0.62 it/s | 12 s | Baseline | Baseline |
Base |
SCO |
2.55 GB | 0.48 it/s | 16 s | Baseline | Baseline |
Base + TE |
CUDA |
25.60 GB | 1.28 it/s | 6 s | N/A (base OOM) | N/A |
Base + TE |
MCO |
20.16 GB | 0.83 it/s | 9 s | +33.9% | -0.1% |
Base + TE |
SCO |
6.08 GB | 0.48 it/s | 16 s | -0.5% | +138.4% |
Base + TR |
CUDA |
24.51 GB | 3.79 it/s | 2 s | N/A (base OOM) | N/A |
Base + TR |
MCO |
17.39 GB | 1.08 it/s | 7 s | +75.0% | -13.8% |
Base + TR |
SCO |
4.35 GB | 2.70 it/s | 2 s | +461.6% | +70.6% |
Base + TR + TE |
CUDA |
14.00 GB | 3.81 it/s | 2 s | N/A (base OOM) | N/A |
Base + TR + TE |
MCO |
7.52 GB | 1.88 it/s | 4 s | +204.6% | -62.7% |
Base + TR + TE |
SCO |
7.69 GB | 2.68 it/s | 2 s | +457.4% | +201.6% |
Text-to-image (1024x1024)
| Setup | Mode | Peak VRAM | Throughput | Time to image | Throughput change vs. base | VRAM change vs. base |
|---|---|---|---|---|---|---|
Base |
CUDA |
OOM | - | - | Baseline unavailable | Baseline unavailable |
Base |
MCO |
18.53 GB | 0.83 it/s | 9 s | Baseline | Baseline |
Base |
SCO |
2.55 GB | 0.62 it/s | 12 s | Baseline | Baseline |
Base + TR + TE |
CUDA |
15.21 GB | 8.91 it/s | <1 s | N/A (base OOM) | N/A |
Base + TR + TE |
MCO |
6.42 GB | 2.60 it/s | 3 s | +214.5% | -65.4% |
Base + TR + TE |
SCO |
7.72 GB | 3.00 it/s | 2 s | +383.0% | +202.7% |
Base=black-forest-labs/FLUX.2-klein-9b-kvTE=tonera/Qwen3-text-NunchakuTR=tonera/FLUX.2-klein-9b-kv-Nunchaku/svdq-{precision}_r32-FLUX.2-klein-9b-kv-Nunchaku.safetensorsMCO=enable-model-cpu-offloadSCO=enable-sequential-cpu-offload