Inference speed on 12GB VRAM

#15
by drakexp - opened

If I plan to use Q6_K with llama that is 16GB, does anyone have performance/speed numbers(tok/s)? Tried Qwen3.5 9B Q8 (9GB) with 65K context but the inference speed is too low due to swapping b/w ram and vram.

not q6_k but with IQ4_XS 600 pp 25tg
rtx 4070
ryzen 7700
ddr5 64gb

not q6_k but with IQ4_XS 600 pp 25tg
rtx 4070
ryzen 7700
ddr5 64gb

If that is 25tok/s with 600 context, I will stick with Qwen3.5 9B Q6_K where I get ~40tok/s with 65K context.

i meant prompt processing t/s

yes, the 40tok/s is the prompt processing speed for me with 65536 context size.

may i misunderstood
스크린샷 2026-04-07 125118

this is with 262144 context size

image
For qwen3.5

Sign up or log in to comment