New updates with many llama.cpp fixes

#4
by danielhanchen - opened

Please re-download. We just updated them again in response to:

  1. kv-cache : support attention rotation for heterogeneous iSWA https://github.com/ggml-org/llama.cpp/pull/21513
  2. CUDA: check for buffer overlap before fusing - CRITICAL fixes <unused24> tokens https://github.com/ggml-org/llama.cpp/pull/21566
  3. vocab : add byte token handling to BPE detokenizer for Gemma4 https://github.com/ggml-org/llama.cpp/pull/21488
  4. convert : set "add bos" == True for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21500
  5. common : add gemma 4 specialized parser https://github.com/ggml-org/llama.cpp/pull/21418
  6. llama-model: read final_logit_softcapping for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21390
  7. llama: add custom newline split for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21406

Do NOT use CUDA 13.2 to run any quant (this is not an Unsloth issue) see here. You can use our llama.cpp precompiled binary which uses CUDA 13, or you can use Unsloth Studio which does not use 13.2.

danielhanchen pinned discussion
danielhanchen unpinned discussion

Sign up or log in to comment