I tested the V19 Q5 version GGUF many times, but the speed is actually slower than the original V19aio version?

#21

by jacklee95277 - opened Jan 20

Jan 20

V19-Q5-GGUF takes an average of 45 seconds, while the original V19aio only requires 35 seconds. However, it can indeed reduce memory usage, but I thought the generation speed would be faster with quantization.

Arunk25

Owner Jan 21

From ChatGPT--

The short answer: the quantized Q5-GGUF is bottlenecked by dequantization and memory access, not raw math, so it ends up slower than FP8-scaled in your setup.

What’s going on

Although Q5 is smaller, it’s not necessarily faster. The main issues are:

Dequantization overhead

Q5-GGUF weights must be unpacked and dequantized on the fly.

This adds extra instructions per token.

FP8-scaled weights are often consumed directly by optimized GPU kernels.

Kernel maturity & hardware support

FP8 is increasingly natively supported on modern GPUs (Hopper, Ada, ROCm, etc.).

GGUF Q5 often falls back to less-optimized kernels (or CPU-side work).

Result: FP8 runs closer to peak throughput.

Memory bandwidth vs compute

Q5 saves memory, but:

It increases memory reads + unpacking

Bandwidth + cache misses dominate

FP8 is larger but streams cleanly through fast kernels.

KV cache still dominates

Quantizing weights doesn’t reduce:

KV cache size

Attention compute

So the expected speedup from Q5 never materializes.

Framework/runtime differences

GGUF paths (e.g., llama.cpp) prioritize portability, not max GPU speed.

FP8 paths (vLLM, TensorRT-LLM, ExLlamaV2, etc.) are aggressively optimized.

Why you see 45s vs 35s
Format Bottleneck
Q5-GGUF Dequantization + memory ops
FP8-scaled Pure GPU math (fast kernels)

So the “issue” is quantization overhead outweighing its benefits in your environment.

When Q5 does make sense

CPU inference

Low-VRAM GPUs

Batch size = 1, latency not critical

When FP8 wins

GPU inference

Modern hardware

Throughput or latency sensitive workloads

jacklee95277

Jan 21

Thank you I got it

Arunk25 changed discussion status to closed Jan 21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment