Thank you for the IQ4_XS version.

#3
by eduensarceno - opened

Thank you for your work; I really appreciate it.
The mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS is currently the flagship model on my workstation.


Hardware

  • GPU: NVIDIA GeForce RTX™ 3050 (8 GiB)
  • CPU: 11th Gen Intel® Core™ i5-11400F × 12 (16 GiB RAM)
  • App: llama.cpp
  • Speed: 22 T/s

llama.cpp Tweaks for i5-11400F

llama-server \
    --fit on --fit-target 1024 \ # The i5-11400F doesn't have an iGPU
    --flash-attn on \
    --kv-offload --cache-type-k q8_0 --cache-type-v q8_0 \
    --ctx-size 98304 --batch-size 128 --ubatch-size 512 \ # The i5-11400F has a MAX_N_BATCH of 128
    --mmap --backend-sampling \ # mmap is required since weights do not fit in RAM
    -hf mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS # iQ4_XS is the sweet spot between quality and speed

NOTE: It is recommended to set the ZRAM size to at least 8 GiB to prevent heavy swapping between the NVMe and RAM.

Sign up or log in to comment