Performance is poor

by sainnhe - opened Jan 24

Jan 24

Hello, thanks for your awesome work!

I’m trying this model, along with the original 30B variant, and I find that the 30B variant produces significantly better results. The REAP variant’s output is just like how 30B variant performs before your Jan 21 update. It produces many meaningless tokens.

Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.

I’m missing something? Thanks in advance!

dan9070

Jan 25

Also noticed this when I downloaded the Q4_K_XL Quant. It just kept generating tokens.

danielhanchen

Unsloth AI org Jan 25

Hello, thanks for your awesome work!

I’m trying this model, along with the original 30B variant, and I find that the 30B variant produces significantly better results. The REAP variant’s output is just like how 30B variant performs before your Jan 21 update. It produces many meaningless tokens.

Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.

I’m missing something? Thanks in advance!

Also noticed this when I downloaded the Q4_K_XL Quant. It just kept generating tokens.

Where are you guys using this?

sainnhe

Jan 25

Hello, thanks for your awesome work!

I’m trying this model, along with the original 30B variant, and I find that the 30B variant produces significantly better results. The REAP variant’s output is just like how 30B variant performs before your Jan 21 update. It produces many meaningless tokens.

Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.

I’m missing something? Thanks in advance!

Also noticed this when I downloaded the Q4_K_XL Quant. It just kept generating tokens.

Where are you guys using this?

Do you mean under what circumstances it would be used? I use this model for code completion. Here is my launch script: https://github.com/sainnhe/dotfiles/blob/924163f2089ae7d9bdf27201d5c3062dafa6ce91/scripts/llama-server.sh

And also code editor config: https://github.com/sainnhe/dotfiles/blob/924163f2089ae7d9bdf27201d5c3062dafa6ce91/.vim/features/full.vim#L224-L268

dan9070

Jan 25

I personally tested the model in LM Studio, as their latest update supposedly fixed glm 4.7's issue but maybe I am wrong.

sainnhe

Jan 26

@danielhanchen I think the most important thing need to be confirmed is that is the llama.cpp fix applied to this REAP variant? The llama.cpp fix I refer here means the statement at the top of https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF#jan-21-update-llamacpp-fixed-a-bug-that-caused-looping-and-poor-outputs-we-updated-the-ggufs---please-re-download-the-model-for-much-better-outputs

Because in my own test, this REAP variant behaves exactly like how the original 30B variant performed BEFORE the fix was applied.

@dan9070 Could you run this REAP variant and the original 30B variant side by side and compare their performance? In my own test, the 30B variant performs significantly better.

danielhanchen

Unsloth AI org Jan 28

If it helps I'm reuploading them and redoing them as we speak - hope they're better!

dugrema

Jan 28

•

edited Jan 28

Just re-downloaded, no issues noticed. I'm getting > 40 tokens/sec on the Q4_K_XL file with a RTX5060ti w/16GB. I'm not seeing any different behavior when using tools compared to the non-REAP version. Doing well on Goose. And no looping.

I'm still having the same issues with tooling in Zed when using the llama.cpp current version of 7850 (same as non-REAP GLM), main symptom is trailing </arg_value> and thought process in the file when using Zed's edit_file tool, rules can't fix it completely. This is still usable, I just had it do a Snake game in Python.

Params: -fa on -fitt 500 --fit-ctx 65536 --n-cpu-moe 11 --threads 4 --temp 0.3 --top-p 1.0 --min-p 0.01 --jinja (yes, temp 0.3 instead of the recommended 0.7, I don't need it to be creative for coding).

Of note, the REAP version is definitely lobotomized on general knowledge! But it doesn't seem to be an issue when creating and running scripts or coding. If it holds up with HTML, Javascript and Rust, I may have just found my first daily driver for 2026.

danielhanchen

Unsloth AI org Jan 29

@dugrema Oh wait so the re-uploads helped a bit?

sainnhe

Jan 29

Same. I redownloaded the model file and tried code completion on my machine, and it performs noticeably better.

I noticed that the file checksum has been changed. Is it a new quantified version?

Anyway, thanks for your great work!

dugrema

Jan 30

@dugrema Oh wait so the re-uploads helped a bit?

@danielhanchen I didn't try the original REAP version enough to tell. I can still get looping when I try enough on both REAP and non-REAP GLM 4.7 Flash UD Q4_K_XL. But looping is rare enough with large prompts and tool usage and the recommended parameters.

I tried working in Python and Javascript (React + vite) to build mini-games in both REAP and non-REAP and at this time I can't tell the difference between one or the other when it comes to coding results and tool usage. Pretty much the same behaviors. It's my first time trying a REAP version seriously, I am not disappointed with the Q4.

dan9070

Feb 3

Much better from my testing. Less repetitions. Thank you.

BingoBird

Feb 23

•

edited Feb 23

So far it's coherent and not glitching for me.

/pr/Neural/LLM/llama.cpp-clort-cli/build/bin/llama-cli -m ./GLM-4.7-Flash-REAP-23B-A3B-IQ4_XS.gguf -t 5 --ctx-size 8192 --jinja

[ Prompt: 7.4 t/s | Generation: 3.5 t/s ]

This is getting pulled on a system without even 8GB free.

I'm running on a 16GB thinkpad tp495, with browser running while listening to podcasts.

btop sows llama-cli using 1.9GB RAM and Radeontop says:
1773M / 1976M VRAM 89.71% │
6517M / 6936M GTT 93.96% │

How llama.cpp do this magic. IDK but a happy camper i be. Looking at cpu% during inference, it's hovering around 10%, so vulkan is being used:
0.93G / 1.20G Memory Clock 77.75% │
0.42G / 1.20G Shader Clock 34.97% │

Depth of thought closer to gpt-oss-20b then qwen3 4b. Pretty cool stuff!

sainnhe

Feb 23

Close this discussion since this issue seems to be fixed.

sainnhe changed discussion status to closed Feb 23

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment