Inference speed on 12GB VRAM
#15
by drakexp - opened
If I plan to use Q6_K with llama that is 16GB, does anyone have performance/speed numbers(tok/s)? Tried Qwen3.5 9B Q8 (9GB) with 65K context but the inference speed is too low due to swapping b/w ram and vram.
not q6_k but with IQ4_XS 600 pp 25tg
rtx 4070
ryzen 7700
ddr5 64gb
not q6_k but with IQ4_XS 600 pp 25tg
rtx 4070
ryzen 7700
ddr5 64gb
If that is 25tok/s with 600 context, I will stick with Qwen3.5 9B Q6_K where I get ~40tok/s with 65K context.
i meant prompt processing t/s
yes, the 40tok/s is the prompt processing speed for me with 65536 context size.

