Speed inference UD-IQ2_M
#2
by Ukro - opened
Anybody can give some info on tg/pg ? for the actual suggested 1x24GB GPU and 256GB of RAM
It would be great if in docs some suggestions showed some tg/pg for the specific most used GPUS.
It's gonna be very slow. With ik_llama, fully in VRAM this model gives ~500-1000tps/prefill and 35-40tps generation and drops a lots as context growths. If any layers spill into RAM at all, even just a bit - generation drops to 10-12tps. Haven't tested on low VRAM environments. My guess is 1.5tps max for gen. You can use dense 40gb models as point of reference.