Performance report: 44-62 t/s with 72GB VRAM and 512k context
#6
by SlavikF - opened
System:
- NVIDIA RTX 4090D 48GB
- NVIDIA RTX 3090 24GB
- 256GB DDR5-4800 RAM
- Intel Xeon W5-3425
llama.cpp:server-cuda12-b8477 --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-IQ3_S
Started llama.cpp with 512k context.
[local-nemotron3-120b]
ctx-size=524288
kv-unified=1
parallel=1
top-p=0.95
temp=1.0
on the small context, I'm getting ~62 t/s
Submitted the query with 280k+ :
prompt eval time = 188 s / 282836 tokens ( 0.67 ms per token, 1499.10 tokens per second)
eval time = 58 s / 2616 tokens ( 22.37 ms per token, 44.70 tokens per second)
total time = 247 s / 285452 tokens