Confusion regarding Q8_0 GGUF
#20
by balukumar - opened
I heard people saying on reddit that Q8_0 GGUF is better quality than FP8 and closer to FP16.
But what I am not understanding is - Phr00t's original models are in FP8 themselves right? And there is no FP16/BF16 available of his AIO
So I'm guessing you directly quantized the FP8 AIO itself, right?
So are these GGUFs 'double quantized', so to say?
And in this case, which is more accurate - FP8 or Q8_0 GGUF made from FP8?
While I start to Quantize, First step is convert it to BF16 or FP16 and then quantize it from there.
While converting the size of the file doubles to around 40GB. That file is used to create these quants.
Technically I have no idea about the quality of these different precisions.
balukumar changed discussion status to closed