Request for Q8 or Q6

#3
by silverfangx - opened

Thank you so much for making Q4_K_XL. If you could make Q8 or Q6, I would really appreciate it.

I have the files ready, but I tried four times to upload them, and huggingface always raises an error:

Bad request for commit endpoint:
Your push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - UD-Q8_K_XL/GLM-4.7-PRISM-UD-Q8_K_XL-00001-of-00008.gguf

And the faulty file is never the same, this is completely random. I will try to upload them by hand one by one, but this is driving me crazy. (Could it be that I didn't pay for huggingface-PRO?)

Have you try manual Re-tracking? I don't know if this will help or not.
git add --renormalize UD-Q8_K_XL/GLM-4.7-PRISM-UD-Q8_K_XL-00001-of-00008.gguf
git commit -m "Fix LFS pointer"
git push

I don't use git, I use the huggingface python library

from huggingface_hub.hf_api import HfApi
api = HfApi()
api.upload_folder(
    repo_id="AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF",
    folder_path="quantization/GLM-4.7-PRISM/UD-Q8_K_XK",
    path_in_repo="UD-Q8_K_XK",
    token="abcdefgh",
)

with the python libraries hf-xet 1.2.0 and huggingface_hub 1.2.3

I "solved" this by uploading the files one by one and restarting when it fails... But this should be the role of huggingface to do this, not mine. Imagine if TCP requires an human verification for every packet

Let me know if it works for you. By the way, what is your hardware and your generation speed please? I have a 9950x3D and it's max capacity is 192GB of ram (6000MT/s), which is why I initially quantized the Q4 version, and my generation speed is 5 t/s. The Q4 version barely fits in my ram if I offload some layers on my 5090

My hardware is a potato, haha. I just use cloud computer. When I ran the Q8_K_XL on an AMD EPYC 64-core, I was getting [Prompt: 8.0 t/s | Generation: 2.0 t/s]. At that speed, it is unusable on the CPU. Maybe there are some optimizations that could improve the speed, but I’m planning to run the whole model on 8x RTX A6000s. Thanks again for uploading the Q8 and Q6 versions!

Ah yes, I see why yours is so slow. The Amd EPYC 64-core is compatible with 4800 MT/s ram. So the ram is slower, and the there are twice as much data to load per token (Q8 vs Q4), so 6000Hz/4800Hz * 8/4 = 2.5. My generation speed is 2.5 * faster than yours. If you offload on the GPUs, the CPU will have much more less memory to load and it will be much faster

Sign up or log in to comment