GGUF VRAM Usage
As a 12gb VRAM RTX 5070 user, I've been playing around with using GGUF models for some time. But there's something I don't get... Quite often, I get worse performance and system utilisation from GGUFs than much larger FP8 models...
Rapid AIO is a prime example. If I load the 23gb+ FP8, I can generate a 720x720 81 frame video in around 2 minutes. Yet if I load a much smaller GGUF model, it takes much longer to load and get to the ksampler's first step, and then often takes at least 2 minutes if not more for the same generation while using up to 97% VRAM...
And curiously, let's say a Q4 quant uses 97% VRAM and takes 4 minutes. I then try a much smaller Q2 quant and it uses the same vram and takes just as long!
This seems a bit odd to me, that a 23gb+ fp8 can run faster and better than an 8gb or so GGUF... And this has been true for almost every GGUF, so I'm not saying it's an issue with these GGUFs... I wonder if it's an issue with Comfy's integration of GGUF nodes rather than an issue with the GGUFs themselves...?