Error Loading in KoboldCPP

#1
by sheliak - opened

Getting an error about missing layers when attempting to load a split Q3. I believe that this is related to the way the files are split. When I attempted to merge with llama.cpp, I got a missing split information error. The issues does not happen with the single-file Q2 quant.

sheliak changed discussion title from Does not work with KoboldCPP to Error Loading in KoboldCPP

This error means that you forgot to concatenate the files after downloading the GGUF fragments. Either you concatenate the downloaded files using GLM-4.5-Air-Derestricted.i1-IQ3_M.gguf.part1of2 GLM-4.5-Air-Derestricted.i1-IQ3_M.gguf.part2of2 > GLM-4.5-Air-Derestricted.i1-IQ3_M.gguf (open Git Bash from Git for Windows in the same folder to enter this command) or download the already concatenated GGUF from https://hf.tst.eu/model#GLM-4.5-Air-Derestricted-GGUF

There are many reasons why we don't use the llama-split format. The most important being that it doesn't support zero copy. This means that using llama-split you need to copy all the data when splitting or merging the files. This is a massive waste of resources booth on our end and on our users side (if they want to merge them). We have many quantization servers that use hard disks and so are usually disk bottlenecked so splitting every quant using llama-split would almost half our quant throughput. In addition to that using llama-split would break our download page where users can already the already concatenated file it simply concatenates the download streams. Once HuggingFace lifts the 50 GB upload limit when they get rid of the legacy LFS download path we are in a much better position than quanters using the llama-split format as we could work with HuggingFace to have them concatenate all our split quants server-side without having to reupload petabytes of files. There also is no technical reason why you couldn't load the non-concatenated files. No idea why anyone would want to do so as you can zero copy concatenate them within a fraction of a second but if you really want to there are things like concatfs that lets you mount them to a virtually concatenated file. It’s also worth mentioning that back when mradermcher started our way of splitting GGUFs was the standard used by TheBloke and anyone active at the time as llama-split not even yet existed. Back then all users were used to our way of concatenating quants and because we continued to split that way all our users are still used to our way of splitting them so switching now would cause a lot of confusion.

Sign up or log in to comment