gguf
would help distribution to make a gguf version as well. I tried with the gguf my repo tool but it throws an exception. thought it would just work as it's a llama.
Since this fine tune is classic one - no new added layers etc, its 1:1 original(except weights and tokenizer content + embeddings), i need to look how unsloth did it, or maybe we can ask them politely to convert this fine tune. Also i have new RAG finetuned version of this model, ill upload when ill have spare time, its way better at following instructions.
I doubt they will add this to their catalog, but can assist whit converting.
Im still waiting for something smaller, running 30B dense model on consumer hardware is not a easy task, MoE sure.
I run this model in workflow/agentic environment where humans are waiting for output on VLLM 4xA40 GPU node. So thats not a consumer hardware.
Yes, pushing consumer hardware limits on these models are quite an adventure alas.
I have found that conversion to gguf when using llamma.cpp convert_hf_to_gguf.py is failing due to unrecognized 'no' in Readme.md
language: [en, de, fr, es, it, pt, nl, pl, lv, et, lt, cs, sk, ro, bg, sl, hr, sv, da, fi, hu, uk, ru, zh, hi, ja, ko, el, no]
Probably should be:
language: [en, de, fr, es, it, pt, nl, pl, lv, et, lt, cs, sk, ro, bg, sl, hr, sv, da, fi, hu, uk, ru, zh, hi, ja, ko, el, nb]
I have found that conversion to gguf when using llamma.cpp convert_hf_to_gguf.py is failing due to unrecognized 'no' in Readme.md
language: [en, de, fr, es, it, pt, nl, pl, lv, et, lt, cs, sk, ro, bg, sl, hr, sv, da, fi, hu, uk, ru, zh, hi, ja, ko, el, no]Probably should be:
language: [en, de, fr, es, it, pt, nl, pl, lv, et, lt, cs, sk, ro, bg, sl, hr, sv, da, fi, hu, uk, ru, zh, hi, ja, ko, el, nb]
Thank You! Easy fix!
Cool. That made it work I guess now downloading a guff
https://huggingface.co/KnutJaegersberg/tildeopen-30b-mu-instruct-Q8_0-GGUF
https://huggingface.co/KnutJaegersberg/tildeopen-30b-mu-instruct-Q4_K_M-GGUF
I will perhaps check out how tokenizer behaves in this conversion - it uses slow one or something else, since even if its broken, it sort of predicts somewhat plausible, but degraded output tokens.