Inference speed on 12GB VRAM

#15

by drakexp - opened 16 days ago

•

If I plan to use Q6_K with llama that is 16GB, does anyone have performance/speed numbers(tok/s)? Tried Qwen3.5 9B Q8 (9GB) with 65K context but the inference speed is too low due to swapping b/w ram and vram.

siegfried2p

16 days ago

not q6_k but with IQ4_XS 600 pp 25tg
rtx 4070
ryzen 7700
ddr5 64gb

drakexp

16 days ago

not q6_k but with IQ4_XS 600 pp 25tg
rtx 4070
ryzen 7700
ddr5 64gb

If that is 25tok/s with 600 context, I will stick with Qwen3.5 9B Q6_K where I get ~40tok/s with 65K context.

siegfried2p

16 days ago

•

edited 16 days ago

i meant prompt processing t/s

drakexp

16 days ago

yes, the 40tok/s is the prompt processing speed for me with 65536 context size.

siegfried2p

16 days ago

may i misunderstood

this is with 262144 context size

drakexp

16 days ago

For qwen3.5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment