Apr 8 - New GGUF Updates

#20
by danielhanchen - opened

Please re-download. We just updated them again in response to:

  1. kv-cache : support attention rotation for heterogeneous iSWA https://github.com/ggml-org/llama.cpp/pull/21513
  2. CUDA: check for buffer overlap before fusing - CRITICAL fixes <unused24> tokens https://github.com/ggml-org/llama.cpp/pull/21566
  3. vocab : add byte token handling to BPE detokenizer for Gemma4 https://github.com/ggml-org/llama.cpp/pull/21488
  4. convert : set "add bos" == True for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21500
  5. common : add gemma 4 specialized parser https://github.com/ggml-org/llama.cpp/pull/21418
  6. llama-model: read final_logit_softcapping for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21390
  7. llama: add custom newline split for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21406

Also some quants like Q6_K_XL are now faster after changing some BF16 layers to Q8_0 since upcasting was reserved for Q8_K_XL.

Do NOT use CUDA 13.2 to run any quant (this is not an Unsloth issue) see here. You can use our llama.cpp precompiled binary which uses CUDA 13, or you can use Unsloth Studio which does not use 13.2.

danielhanchen pinned discussion

Will you guys be updating the 31b as well?

danielhanchen changed discussion title from GGUF update - `<unused24> token` fixes to Apr 8 - New GGUF Update
Unsloth AI org

Will you guys be updating the 31b as well?

Yes it's uploading

danielhanchen changed discussion title from Apr 8 - New GGUF Update to Apr 8 - New GGUF Updates

I do not understand why you would make new quants because of these PRs. All of them are inference fixes, they do not affect the the convert.py file. Number 4 has a check in place that adds the BOS for current GGUFs.

There was really no need to generate new ggufs.

Unsloth AI org

I do not understand why you would make new quants because of these PRs. All of them are inference fixes, they do not affect the the convert.py file. Number 4 has a check in place that adds the BOS for current GGUFs.

There was really no need to generate new ggufs.

Hello, the fixes do actually in fact affect the GGUF uploads, including the imatrix file!! It needs to recomputed. People have said the new quants are already much better.

Could at least at model.yamls for the thinking toggles in lm studio ;)

Latest model downloaded, latest llama.cpp downloaded (b8708) , sadly vulkan + amd card gpu experience is still the same - when fa is on the llama.cpp crashes (input is a 24kb / 8k token html/js file).

> /read /home/martin/tmp/al3.html
Loaded text from '/home/martin/tmp/al3.html'
> Why do i have an issue with ctx variable when the audio context is initialized?
/double free or corruption (out)
Aborted (core dumped)

attention rotation not working as expected. Am I doing something wrong?

.\llama-server.exe -m .\models\gguf\gemma-4-26B-A4B-it-UD-IQ4_XS.gguf -ctv q4_0 -ctk q4_0 -fa on

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12281 MiB):
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes, VRAM: 12281 MiB
load_backend: loaded CUDA backend from C:\LLM\ggml-cuda.dll
load_backend: loaded RPC backend from C:\LLM\ggml-rpc.dll
load_backend: loaded CPU backend from C:\LLM\ggml-cpu-zen4.dll
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b8708-ae65fbdf3

llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache: CUDA0 KV buffer size = 22.50 MiB
llama_kv_cache: size = 22.50 MiB ( 4096 cells, 5 layers, 4/1 seqs), K (q4_0): 11.25 MiB, V (q4_0): 11.25 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0
llama_kv_cache_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache: CUDA0 KV buffer size = 800.00 MiB
llama_kv_cache: size = 800.00 MiB ( 4096 cells, 25 layers, 4/1 seqs), K (f16): 400.00 MiB, V (f16): 400.00 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0
sched_reserve: reserving ...

@danielhanchen thanks for providing the updated quants! It seems the recipe for IQ4_NL and IQ4_XS is equal, both seem to use IQ4_NL for ffn_down_exps.weight and IQ3_S for blk.7.ffn_gate_up_exps.weight. The SHA256 sum is different though. Is there another difference not visible in the quant layers somewhere else?

Can we expect a similar update to the Q8 quants soon?

Unsloth AI org

Do NOT use CUDA 13.2 to run any quant (this is not an Unsloth issue) see here. You can use our llama.cpp precompiled binary which uses CUDA 13, or you can use Unsloth Studio which does not use 13.2.

danielhanchen unpinned discussion

Sign up or log in to comment