Thank you for the IQ4_XS version.

by eduensarceno - opened 23 days ago

Thank you for your work; I really appreciate it.
The mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS is currently the flagship model on my workstation.

Hardware

GPU: NVIDIA GeForce RTX™ 3050 (8 GiB)
CPU: 11th Gen Intel® Core™ i5-11400F × 12 (16 GiB RAM)
App: llama.cpp
Speed: 22 T/s

llama.cpp Tweaks for i5-11400F

llama-server \
    --fit on --fit-target 1024 \ # The i5-11400F doesn't have an iGPU
    --flash-attn on \
    --kv-offload --cache-type-k q8_0 --cache-type-v q8_0 \
    --ctx-size 98304 --batch-size 128 --ubatch-size 512 \ # The i5-11400F has a MAX_N_BATCH of 128
    --mmap --backend-sampling \ # mmap is required since weights do not fit in RAM
    -hf mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS # iQ4_XS is the sweet spot between quality and speed

NOTE: It is recommended to set the ZRAM size to at least 8 GiB to prevent heavy swapping between the NVMe and RAM.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment