NEW: LLama.cpp: Using `ngram-mod` to Get 2x Speed Boost on Long-Chats/Agent!

#20
by PussyHut - opened

https://github.com/ggml-org/llama.cpp/pull/19164

Basically you only need is to add

--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

and now you get at least 2x the speed boost in agent|coding|long chats use!

PussyHut changed discussion title from NEW: LLama.cpp: Using `ngram-mod` to Speed Up Long-Chats/Agent! to NEW: LLama.cpp: Using `ngram-mod` to Get 2x Speed Boost on Long-Chats/Agent!

Works with all MoE models!

Sign up or log in to comment