Performance is poor
Hello, thanks for your awesome work!
Iβm trying this model, along with the original 30B variant, and I find that the 30B variant produces significantly better results. The REAP variantβs output is just like how 30B variant performs before your Jan 21 update. It produces many meaningless tokens.
Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.
Iβm missing something? Thanks in advance!
Also noticed this when I downloaded the Q4_K_XL Quant. It just kept generating tokens.
Hello, thanks for your awesome work!
Iβm trying this model, along with the original 30B variant, and I find that the 30B variant produces significantly better results. The REAP variantβs output is just like how 30B variant performs before your Jan 21 update. It produces many meaningless tokens.
Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.
Iβm missing something? Thanks in advance!
Also noticed this when I downloaded the Q4_K_XL Quant. It just kept generating tokens.
Where are you guys using this?
Hello, thanks for your awesome work!
Iβm trying this model, along with the original 30B variant, and I find that the 30B variant produces significantly better results. The REAP variantβs output is just like how 30B variant performs before your Jan 21 update. It produces many meaningless tokens.
Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.
Iβm missing something? Thanks in advance!
Also noticed this when I downloaded the Q4_K_XL Quant. It just kept generating tokens.
Where are you guys using this?
Do you mean under what circumstances it would be used? I use this model for code completion. Here is my launch script: https://github.com/sainnhe/dotfiles/blob/924163f2089ae7d9bdf27201d5c3062dafa6ce91/scripts/llama-server.sh
And also code editor config: https://github.com/sainnhe/dotfiles/blob/924163f2089ae7d9bdf27201d5c3062dafa6ce91/.vim/features/full.vim#L224-L268
I personally tested the model in LM Studio, as their latest update supposedly fixed glm 4.7's issue but maybe I am wrong.
@danielhanchen I think the most important thing need to be confirmed is that is the llama.cpp fix applied to this REAP variant? The llama.cpp fix I refer here means the statement at the top of https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF#jan-21-update-llamacpp-fixed-a-bug-that-caused-looping-and-poor-outputs-we-updated-the-ggufs---please-re-download-the-model-for-much-better-outputs
Because in my own test, this REAP variant behaves exactly like how the original 30B variant performed BEFORE the fix was applied.
@dan9070 Could you run this REAP variant and the original 30B variant side by side and compare their performance? In my own test, the 30B variant performs significantly better.
If it helps I'm reuploading them and redoing them as we speak - hope they're better!
Just re-downloaded, no issues noticed. I'm getting > 40 tokens/sec on the Q4_K_XL file with a RTX5060ti w/16GB. I'm not seeing any different behavior when using tools compared to the non-REAP version. Doing well on Goose. And no looping.
I'm still having the same issues with tooling in Zed when using the llama.cpp current version of 7850 (same as non-REAP GLM), main symptom is trailing </arg_value> and thought process in the file when using Zed's edit_file tool, rules can't fix it completely. This is still usable, I just had it do a Snake game in Python.
Params: -fa on -fitt 500 --fit-ctx 65536 --n-cpu-moe 11 --threads 4 --temp 0.3 --top-p 1.0 --min-p 0.01 --jinja (yes, temp 0.3 instead of the recommended 0.7, I don't need it to be creative for coding).
Of note, the REAP version is definitely lobotomized on general knowledge! But it doesn't seem to be an issue when creating and running scripts or coding. If it holds up with HTML, Javascript and Rust, I may have just found my first daily driver for 2026.
Same. I redownloaded the model file and tried code completion on my machine, and it performs noticeably better.
I noticed that the file checksum has been changed. Is it a new quantified version?
Anyway, thanks for your great work!
@dugrema Oh wait so the re-uploads helped a bit?
@danielhanchen I didn't try the original REAP version enough to tell. I can still get looping when I try enough on both REAP and non-REAP GLM 4.7 Flash UD Q4_K_XL. But looping is rare enough with large prompts and tool usage and the recommended parameters.
I tried working in Python and Javascript (React + vite) to build mini-games in both REAP and non-REAP and at this time I can't tell the difference between one or the other when it comes to coding results and tool usage. Pretty much the same behaviors. It's my first time trying a REAP version seriously, I am not disappointed with the Q4.
Much better from my testing. Less repetitions. Thank you.
So far it's coherent and not glitching for me.
/pr/Neural/LLM/llama.cpp-clort-cli/build/bin/llama-cli -m ./GLM-4.7-Flash-REAP-23B-A3B-IQ4_XS.gguf -t 5 --ctx-size 8192 --jinja
[ Prompt: 7.4 t/s | Generation: 3.5 t/s ]
This is getting pulled on a system without even 8GB free.
I'm running on a 16GB thinkpad tp495, with browser running while listening to podcasts.
btop sows llama-cli using 1.9GB RAM and Radeontop says:
1773M / 1976M VRAM 89.71% β
6517M / 6936M GTT 93.96% β
How llama.cpp do this magic. IDK but a happy camper i be. Looking at cpu% during inference, it's hovering around 10%, so vulkan is being used:
0.93G / 1.20G Memory Clock 77.75% β
0.42G / 1.20G Shader Clock 34.97% β
Depth of thought closer to gpt-oss-20b then qwen3 4b. Pretty cool stuff!
Close this discussion since this issue seems to be fixed.