Performance report: 44-62 t/s with 72GB VRAM and 512k context

by SlavikF - opened about 1 month ago

Discussion

SlavikF

about 1 month ago

•

edited about 1 month ago

System:

NVIDIA RTX 4090D 48GB
NVIDIA RTX 3090 24GB
256GB DDR5-4800 RAM
Intel Xeon W5-3425

llama.cpp:server-cuda12-b8477 --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-IQ3_S

Started llama.cpp with 512k context.

[local-nemotron3-120b]
ctx-size=524288
kv-unified=1
parallel=1
top-p=0.95
temp=1.0

on the small context, I'm getting ~62 t/s

Submitted the query with 280k+ :

prompt eval time =  188 s / 282836 tokens (    0.67 ms per token,  1499.10 tokens per second)
       eval time =   58 s /  2616 tokens (   22.37 ms per token,    44.70 tokens per second)
      total time =  247 s / 285452 tokens

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment