Steps for running perplexity steps

by TimothyRoo - opened Feb 15

Feb 15

Hi!

I was wondering if you had step-by-step documentation that we could follow to run the exact same perplexity test that you use in your charts?

Thank you!!

ubergarm

Owner Feb 15

Heya @TimothyRoo

I've covered it in a number of places scattered across the web in various degrees of out of datedness hah...

Most recent discussion specific to methodology here: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/3#698f7ebf2aa648f3b77a1262
Some older stuff in this quant cookers guide: https://github.com/ikawrakow/ik_llama.cpp/discussions/434
for an info dense high level overview of stuff i have a recent talk here: https://blog.aifoundry.org/p/adventures-in-model-quantization

I'm getting some more numbers now to figure out the best way to go on attn/shexp/dense for a smol-IQ2_KS which is clocking in just over 200GiB or so and should be faster on CPU/RAM than the KT quants.

Holler if you find anything interesting or have any questions!
Cheers!

nonyaz

Feb 25

Hello @ubergarm ,

Thank you for the perplexity chart you posted, it is very helpful. Would it be out of line to ask you to put a few unsloth numbers on your chart as well? I'm currently running Q3_K_XL, I assume I'm getting a few fractions of a % better than your IQ3_K, but would be nice to confirm! I don't have confidence that I could run these tests myself and trust the numbers.

ubergarm

Owner Feb 25

@nonyaz

Heya, sorry the GLM-5 are pretty big to keep around on disk right and the slowest with A40B to test perplexity. I'm more focused on some of the new Qwen models at the moment. As my model card says, my quants in general offer lower "better" perplexity for a given memory footprint. I have some comparisons with UD quants in a few of my other perplexity charts that you may have seen. I don't have the exact GiB / BPW of the two quants you are considering, but you can look inside the second gguf file and look at the actual tensor sizes used for attn/shexp/first N dense layers etc to see the difference in recipes.

I don't have confidence that I could run these tests myself and trust the numbers.

Since you already have ik_llama.cpp running and the quants downloaded you're almost there. Follow the instructions linked above essentially:

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw


# set the model to the first gguf split you are testing
model=/mnt/raid/hf/MiniMax-M2.5-GGUF/IQ5_K/MiniMax-M2.5-IQ5_K-00001-of-00005.gguf
numactl -N "$SOCKET" -m "$SOCKET" \
  ./build/bin/llama-perplexity \
      -m "$model" \
      -f wiki.test.raw \
      --seed 1337 \
      --ctx-size 512 \
      -ub 4096 -b 4096 \
      --numa numactl \
      --threads 96 \
      --threads-batch 128 \
      --validate-quants \
      --no-mmap

# copy paste the final line e.g. `Final estimate: PPL over 580 chunks for n_ctx=512 = X.XXXX +/- 0.02841`

If you give me your llama-server command I can help massage it into the llama-perplexity command. Let me know if you want any help as you give it a try!

nonyaz

Feb 26

If you give me your llama-server command I can help massage it into the llama-perplexity command. Let me know if you want any help as you give it a try!
@ubergarm

Alright well I only have a mainline llama command thus far as I wasn't able to make IK go faster in almost any situation (Claude 4.6 compiled it, so blame it!), but I do still have it around, you'll have to translate mainline llama to ik...

C:\Users\nonyaz\CascadeProjects\ik_llama.cpp\build\bin

& "C:\Users\nonyaz\CascadeProjects\llama.cpp\build\bin\llama-server.exe" `
  -m "C:\Users\nonyaz\Desktop\unsloth\GLM-5-GGUF\GLM-5-UD-Q3_K_XL-00001-of-00008.gguf" `
  -ngl 99 `
  -ot "blk\.[0-9]\.ffn_(gate|up|down)_exps\.=CUDA0,blk\.1[0-9]\.ffn_(gate|up|down)_exps\.=CUDA0,blk\.2[0-2]\.ffn_(gate|up|down)_exps\.=CUDA0,blk\.2[3-9]\.ffn_(gate|up|down)_exps\.=CUDA1,blk\.3[0-2]\.ffn_(gate|up|down)_exps\.=CUDA1,exps=CPU" `
  -ctk q8_0 -ctv q8_0 -fa on --jinja `
  -b 2048 -ub 2048 -mg 0 `
  --temp 1.0 --top-p 0.97 --repeat-penalty 1.0 `
  -c 47000 --host 127.0.0.1 --port 8012 `
  --alias "GLM-5-UD-Q3_K_XL-unsloth" -np 1```

ubergarm

Owner Feb 26

Hrm, okay windows, you might have to fix back ticks i'll do my best to change your llama-server to llama-perplexity command, but your backticks at the end look strange and starting with & is that some powershell thing?

anyway, you can probably figure it out from here:

& "C:\Users\nonyaz\CascadeProjects\llama.cpp\build\bin\llama-perplexity.exe" `
  -m "C:\Users\nonyaz\Desktop\unsloth\GLM-5-GGUF\GLM-5-UD-Q3_K_XL-00001-of-00008.gguf" `
  -f wiki.test.raw `
  -ngl 99 `
  -ot "blk\.[0-9]\.ffn_(gate|up|down)_exps\.=CUDA0,blk\.1[0-9]\.ffn_(gate|up|down)_exps\.=CUDA0,blk\.2[0-2]\.ffn_(gate|up|down)_exps\.=CUDA0,blk\.2[3-9]\.ffn_(gate|up|down)_exps\.=CUDA1,blk\.3[0-2]\.ffn_(gate|up|down)_exps\.=CUDA1,exps=CPU" `
  -b 2048 -ub 2048 -mg 0 `
  -c 512 ```

You get the idea though. You'll need ik_llama.cpp to test my IQ3_K, and you'll need to use the same exact build/sha1sum if you're trying to test across two different quants...

nonyaz

Feb 26

•

edited Feb 26

Hrm, okay windows [...] starting with & is that some powershell thing?
@ubergarm

Yes Windows :P (the trailing three ticks was a user error trying to make a code block). RE: &? IDK, is what AI uses in Windsurf, so I mimic

anyway, you can probably figure it out from here:

bold statement...

but, numbers are coming out! See you in 7 hours with results...

You'll need ik_llama.cpp to test my IQ3_K, and you'll need to use the same exact build/sha1sum if you're trying to test across two different quants...

Noted.

Edit: Turns out that was a wildly overstated time estimate, here are results:

ubergarm

Owner Feb 26

@nonyaz

Wow you are patient! Don't forget you won't be able to compare the final number with the IQ3_KS because one measurement will be with llama.cpp and the other would be with ik_llama.cpp

You might be able to get it to go faster increasing -ub 4096 -b 4096 and possibly offload some more layers as context is very short here 512 (as it must be for equivalent comparisons in this case).

Nice job getting some numbers though, those look about right! I have some per chunk perplexity available in my logs here: https://huggingface.co/ubergarm/GLM-5-GGUF/blob/main/logs/perplexity-GLM-5-BF16-ctx512.log

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment