Thank you for the IQ4_XS version.
#3
by eduensarceno - opened
Thank you for your work; I really appreciate it.
The mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS is currently the flagship model on my workstation.
Hardware
- GPU: NVIDIA GeForce RTX™ 3050 (8 GiB)
- CPU: 11th Gen Intel® Core™ i5-11400F × 12 (16 GiB RAM)
- App: llama.cpp
- Speed: 22 T/s
llama.cpp Tweaks for i5-11400F
llama-server \
--fit on --fit-target 1024 \ # The i5-11400F doesn't have an iGPU
--flash-attn on \
--kv-offload --cache-type-k q8_0 --cache-type-v q8_0 \
--ctx-size 98304 --batch-size 128 --ubatch-size 512 \ # The i5-11400F has a MAX_N_BATCH of 128
--mmap --backend-sampling \ # mmap is required since weights do not fit in RAM
-hf mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS # iQ4_XS is the sweet spot between quality and speed
NOTE: It is recommended to set the ZRAM size to at least 8 GiB to prevent heavy swapping between the NVMe and RAM.