GGUF quants and llama.cpp inference?

by ljupco - opened Feb 27

Feb 27

Does anyone know if there is any hope of bringing LongCat-Flash-Lite into the llama.cpp codebase? So inference is working under llama.cpp, and we get GGUF quants? For me on an old MacBook Pro M2 (but with 96gb ram), the mlx-lm inference is x5-x10 times slower than equivalent GGUF models quants running llama.cpp llama-server. IDK but I can only presume that it's flash attention, 8-bit KV caches, prompts caching etc that make the huge difference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment