New flag: kv-unified. Performance report: 90 t/s on RTX 4090D 48GB
#18
by SlavikF - opened
Recently llama.cpp added new flag:
--kv-unified
use single unified KV buffer shared across all sequences
kv-unified didn't affect the speed for me, but saved few GBs of VRAM. Now max context fit into 48GB VRAM.
Actual VRAM usage is ~44GB.
Running UD-Q8_K_XL on my system with RTX 4090D 48GB.
- prompt processing: 4000 t/s
- token generation: 90 t/s
docker compose:
services:
llama-router:
image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b7842
container_name: router
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
ports:
- "8080:8080"
volumes:
- /home/slavik/.cache/llama.cpp/router:/root/.cache/llama.cpp/router
- ./models.ini:/app/models.ini
entrypoint: ["./llama-server"]
command: >
--models-dir /root/.cache/llama.cpp/router
--models-max 1
--models-preset ./models.ini
--host 0.0.0.0 --port 8080
models.ini
version = 1
[local-GLM4.7-30b]
ctx-size=202752
temp=0.7
top-p=1.0
min-p=0.01
jinja=1
kv-unified=1
fit=off
Mostly, everything works great. But sometimes it getting into the loop with very repeated reasoning...
Oh very nice!
Unified KV is now enabled by default in llama.cpp if the number of slots is auto (-1), and since that's the default for number of slots, Unified KV should also be the default π
It was not enabled by default for me
Interesting, perhaps there's some condition that affects whether it's enabled by default? Maybe GPU architecture or context size?