Speed inference UD-IQ2_M

by Ukro - opened 12 days ago

Anybody can give some info on tg/pg ? for the actual suggested 1x24GB GPU and 256GB of RAM
It would be great if in docs some suggestions showed some tg/pg for the specific most used GPUS.

curiouspp8

10 days ago

•

edited 10 days ago

It's gonna be very slow. With ik_llama, fully in VRAM this model gives ~500-1000tps/prefill and 35-40tps generation and drops a lots as context growths. If any layers spill into RAM at all, even just a bit - generation drops to 10-12tps. Haven't tested on low VRAM environments. My guess is 1.5tps max for gen. You can use dense 40gb models as point of reference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment