I tested the V19 Q5 version GGUF many times, but the speed is actually slower than the original V19aio version?

#21
by jacklee95277 - opened

V19-Q5-GGUF takes an average of 45 seconds, while the original V19aio only requires 35 seconds. However, it can indeed reduce memory usage, but I thought the generation speed would be faster with quantization.

Owner

From ChatGPT--

The short answer: the quantized Q5-GGUF is bottlenecked by dequantization and memory access, not raw math, so it ends up slower than FP8-scaled in your setup.

What’s going on

Although Q5 is smaller, it’s not necessarily faster. The main issues are:

Dequantization overhead

Q5-GGUF weights must be unpacked and dequantized on the fly.

This adds extra instructions per token.

FP8-scaled weights are often consumed directly by optimized GPU kernels.

Kernel maturity & hardware support

FP8 is increasingly natively supported on modern GPUs (Hopper, Ada, ROCm, etc.).

GGUF Q5 often falls back to less-optimized kernels (or CPU-side work).

Result: FP8 runs closer to peak throughput.

Memory bandwidth vs compute

Q5 saves memory, but:

It increases memory reads + unpacking

Bandwidth + cache misses dominate

FP8 is larger but streams cleanly through fast kernels.

KV cache still dominates

Quantizing weights doesn’t reduce:

KV cache size

Attention compute

So the expected speedup from Q5 never materializes.

Framework/runtime differences

GGUF paths (e.g., llama.cpp) prioritize portability, not max GPU speed.

FP8 paths (vLLM, TensorRT-LLM, ExLlamaV2, etc.) are aggressively optimized.

Why you see 45s vs 35s
Format Bottleneck
Q5-GGUF Dequantization + memory ops
FP8-scaled Pure GPU math (fast kernels)

So the “issue” is quantization overhead outweighing its benefits in your environment.

When Q5 does make sense

CPU inference

Low-VRAM GPUs

Batch size = 1, latency not critical

When FP8 wins

GPU inference

Modern hardware

Throughput or latency sensitive workloads

Thank you I got it

Arunk25 changed discussion status to closed

Sign up or log in to comment