GGUF quants and llama.cpp inference?

#7
by ljupco - opened

Does anyone know if there is any hope of bringing LongCat-Flash-Lite into the llama.cpp codebase? So inference is working under llama.cpp, and we get GGUF quants? For me on an old MacBook Pro M2 (but with 96gb ram), the mlx-lm inference is x5-x10 times slower than equivalent GGUF models quants running llama.cpp llama-server. IDK but I can only presume that it's flash attention, 8-bit KV caches, prompts caching etc that make the huge difference.

Sign up or log in to comment