Confusion regarding Q8_0 GGUF

#20

by balukumar - opened Jan 13

Jan 13

I heard people saying on reddit that Q8_0 GGUF is better quality than FP8 and closer to FP16.

But what I am not understanding is - Phr00t's original models are in FP8 themselves right? And there is no FP16/BF16 available of his AIO

So I'm guessing you directly quantized the FP8 AIO itself, right?

So are these GGUFs 'double quantized', so to say?

And in this case, which is more accurate - FP8 or Q8_0 GGUF made from FP8?

Arunk25

Owner Jan 13

While I start to Quantize, First step is convert it to BF16 or FP16 and then quantize it from there.
While converting the size of the file doubles to around 40GB. That file is used to create these quants.
Technically I have no idea about the quality of these different precisions.

balukumar changed discussion status to closed Jan 13

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment