Apr 8 - New GGUF Updates

#20

by danielhanchen - opened 15 days ago

Discussion

danielhanchen

Unsloth AI org 15 days ago

•

edited 14 days ago

Please re-download. We just updated them again in response to:

kv-cache : support attention rotation for heterogeneous iSWA https://github.com/ggml-org/llama.cpp/pull/21513
CUDA: check for buffer overlap before fusing - CRITICAL fixes <unused24> tokens https://github.com/ggml-org/llama.cpp/pull/21566
vocab : add byte token handling to BPE detokenizer for Gemma4 https://github.com/ggml-org/llama.cpp/pull/21488
convert : set "add bos" == True for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21500
common : add gemma 4 specialized parser https://github.com/ggml-org/llama.cpp/pull/21418
llama-model: read final_logit_softcapping for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21390
llama: add custom newline split for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21406

Also some quants like Q6_K_XL are now faster after changing some BF16 layers to Q8_0 since upcasting was reserved for Q8_K_XL.

Do NOT use CUDA 13.2 to run any quant (this is not an Unsloth issue) see here. You can use our llama.cpp precompiled binary which uses CUDA 13, or you can use Unsloth Studio which does not use 13.2.

danielhanchen pinned discussion 15 days ago

YearZero

15 days ago

Will you guys be updating the 31b as well?

danielhanchen changed discussion title from GGUF update - `<unused24> token` fixes to Apr 8 - New GGUF Update 15 days ago

danielhanchen

Unsloth AI org 15 days ago

Will you guys be updating the 31b as well?

Yes it's uploading

danielhanchen changed discussion title from Apr 8 - New GGUF Update to Apr 8 - New GGUF Updates 15 days ago

Dampfinchen

15 days ago

•

edited 15 days ago

I do not understand why you would make new quants because of these PRs. All of them are inference fixes, they do not affect the the convert.py file. Number 4 has a check in place that adds the BOS for current GGUFs.

There was really no need to generate new ggufs.

danielhanchen

Unsloth AI org 15 days ago

I do not understand why you would make new quants because of these PRs. All of them are inference fixes, they do not affect the the convert.py file. Number 4 has a check in place that adds the BOS for current GGUFs.

There was really no need to generate new ggufs.

Hello, the fixes do actually in fact affect the GGUF uploads, including the imatrix file!! It needs to recomputed. People have said the new quants are already much better.

YorkieOH10

14 days ago

Could at least at model.yamls for the thinking toggles in lm studio ;)

kulminaator

14 days ago

•

edited 14 days ago

Latest model downloaded, latest llama.cpp downloaded (b8708) , sadly vulkan + amd card gpu experience is still the same - when fa is on the llama.cpp crashes (input is a 24kb / 8k token html/js file).

> /read /home/martin/tmp/al3.html
Loaded text from '/home/martin/tmp/al3.html'
> Why do i have an issue with ctx variable when the audio context is initialized?
/double free or corruption (out)
Aborted (core dumped)

siegfried2p

14 days ago

•

edited 14 days ago

attention rotation not working as expected. Am I doing something wrong?

.\llama-server.exe -m .\models\gguf\gemma-4-26B-A4B-it-UD-IQ4_XS.gguf -ctv q4_0 -ctk q4_0 -fa on

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12281 MiB):
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes, VRAM: 12281 MiB
load_backend: loaded CUDA backend from C:\LLM\ggml-cuda.dll
load_backend: loaded RPC backend from C:\LLM\ggml-rpc.dll
load_backend: loaded CPU backend from C:\LLM\ggml-cpu-zen4.dll
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b8708-ae65fbdf3

llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache: CUDA0 KV buffer size = 22.50 MiB
llama_kv_cache: size = 22.50 MiB ( 4096 cells, 5 layers, 4/1 seqs), K (q4_0): 11.25 MiB, V (q4_0): 11.25 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0
llama_kv_cache_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache: CUDA0 KV buffer size = 800.00 MiB
llama_kv_cache: size = 800.00 MiB ( 4096 cells, 25 layers, 4/1 seqs), K (f16): 400.00 MiB, V (f16): 400.00 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0
sched_reserve: reserving ...

jkrauss82

14 days ago

@danielhanchen thanks for providing the updated quants! It seems the recipe for IQ4_NL and IQ4_XS is equal, both seem to use IQ4_NL for ffn_down_exps.weight and IQ3_S for blk.7.ffn_gate_up_exps.weight. The SHA256 sum is different though. Is there another difference not visible in the quant layers somewhere else?

Simplepotat

14 days ago

Can we expect a similar update to the Q8 quants soon?

danielhanchen

Unsloth AI org 14 days ago

Do NOT use CUDA 13.2 to run any quant (this is not an Unsloth issue) see here. You can use our llama.cpp precompiled binary which uses CUDA 13, or you can use Unsloth Studio which does not use 13.2.

danielhanchen unpinned discussion 12 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment