Higher pplx on your quants than I'd expect

by Tibbnak - opened Feb 1

Feb 1

•

I realize this is hardly a scientific test and I'm probably doing something wrong, but on a lark I used llama-perplexity against wiki.test.raw and both your static and dynamic versions of this model have notably higher perplexity than other's quants of it. on b7898 win cuda-13.1 of llama.cpp.

Like, the q5_K_s of this, [1]8.2818,[2]8.2639,[3]7.0423,[4]6.5539,[5]6.6486,[6]6.9491,[7]7.0944,[8]12.2663,[9]12.5848,[10]12.5828,[11]12.2138,[12]12.0499,[13]12.4009,[14]11.7351,[15]11.5003,[16]11.1609,[17]10.9760,[18]11.2433,[19]10.7738,[20]10.5000,[21]10.3878,[22]10.0016,[23]9.5681,[24]9.1953,[25]8.8487,[26]8.6210

and your GLM-4.7-Flash-impotent-heresy.i1-Q6_K.gguf be like, [1]6.9998,[2]7.6182,[3]6.4279,[4]5.9849,[5]6.1682,[6]6.5411,[7]6.7561,[8]15.4324,[9]15.4160,[10]15.0679,[11]14.3920,[12]13.9926,[13]14.2341,[14]13.3099,[15]12.9098,[16]12.4451,[17]12.1471,[18]12.3620,[19]11.7931,[20]11.4475,[21]11.2669,[22]10.8008,[23]10.2903,[24]9.8522,[25]9.4511

and then MuXodious's MXFP4_MoE's like, 1]5.4790,[2]7.0421,[3]5.8185,[4]5.3139,[5]5.6912,[6]6.1956,[7]6.4572,[8]7.1018,[9]7.7571,[10]8.1186,[11]8.1953,[12]8.3655,[13]8.8789,[14]8.5828,[15]8.5727,[16]8.5016,[17]8.4823,[18]8.8303,[19]8.5728,[20]8.4681,[21]8.4543,[22]8.2320,[23]7.9478,[24]7.6934,[25]7.4565

I don't know what's up or why or if I'm fudging something up. Just randomly noticed poking around at different quants after getting annoyed at the model randomly flipping tokens and doing weird occasional spelling mistakes.

Edit, tried your Huihui-GLM-4.7-Flash-abliterated.i1-Q4_K_S.gguf too just incase it was a model thing, [1]6.1722,[2]7.2776,[3]6.6808,[4]6.5140,[5]6.7200,[6]7.1071,[7]7.3376,[8]17.0577,[9]16.9040,[10]16.4071,[11]15.6431,[12]15.1589,[13]15.3671,[14]14.3584,[15]13.8488,[16]13.3572,[17]13.0506,[18]13.2732,[19]12.6529,[20]12.2539,[21]12.0608,[22]11.5537,[23]10.9929,[24]10.5392,[25]10.1121

I checked the sha256's against the one listed on huggingface for the file to make sure it wasn't download corruption.

GLM-4.7-Flash-Derestricted.i1-Q4_K_S.gguf- [1]5.9525,[2]7.0543,[3]6.1259,[4]5.9533,[5]6.1851,[6]6.5485,[7]6.8416,[8]16.5892,[9]16.4296,[10]15.9747,[11]15.2181,[12]14.7275,[13]14.9382,[14]13.9389,[15]13.4611,[16]12.9577,[17]12.6675,[18]12.8878,[19]12.3111,[20]11.9607,[21]11.7548,[22]11.2345,[23]10.6846,[24]10.2352,[25]9.8176

--- edit

Downloaded and converted de-restricted myself manually -> bf16, ran imatrix calibration on unsloth's v5_rc., Quantized to mxp4_moe -> [1]5.5674,[2]7.1342,[3]5.8800,[4]5.3808,[5]5.7540,[6]6.2475,[7]6.5177,[8]13.5293,[9]13.7487,[10]13.5794,[11]13.0768,[12]12.8555,[13]13.2148,[14]12.4242,[15]12.1100,[16]11.7541,[17]11.5087,[18]11.7751,[19]11.2555,[20]10.9649,[21]10.8080,[22]10.4110,[23]9.9479,[24]9.5485,[25]9.1775

But the pplx when tested against the calibration data I used for the imatrix -
Final estimate: PPL = 6.8167 +/- 0.08330 -bf16
Final estimate: PPL = 6.2592 +/- 0.07116 -mxp4moe

So, shrug? Wikitext might just be a bad test file for models nowadays.

Tibbnak changed discussion status to closed Feb 1

nicoboss

Feb 2

Thanks a lot for your testing. I recommend not to look too close into perplexity measurements. They are inherently unreliable at determining the quant quality. I instead recommend to measure the KL-divergence, top token probability and same token probability all of which can easily be measured using llama.cpp. Beside that MXFP4 quants should only be used for GPT OSS based models as they are quite bad for most other architectures and even for GPT OSS based models they are somewhat controversial as many users saw a big quality degradation when using them over higher non-MXFP4 quants. Unfortunately my detailed quant quality measurements do not include MXFP4 as they were made before its introduction into llama.cpp so I can't give an objective opinion about their quality.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment