Text Generation
GGUF
PyTorch
nvidia
nemotron-3
latent-moe
mtp
unsloth
conversational

Performance report: 44-62 t/s with 72GB VRAM and 512k context

#6
by SlavikF - opened

System:

  • NVIDIA RTX 4090D 48GB
  • NVIDIA RTX 3090 24GB
  • 256GB DDR5-4800 RAM
  • Intel Xeon W5-3425

llama.cpp:server-cuda12-b8477 --hf-repo unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-IQ3_S

Started llama.cpp with 512k context.

[local-nemotron3-120b]
ctx-size=524288
kv-unified=1
parallel=1
top-p=0.95
temp=1.0

on the small context, I'm getting ~62 t/s

Submitted the query with 280k+ :

prompt eval time =  188 s / 282836 tokens (    0.67 ms per token,  1499.10 tokens per second)
       eval time =   58 s /  2616 tokens (   22.37 ms per token,    44.70 tokens per second)
      total time =  247 s / 285452 tokens

Sign up or log in to comment