New flag: kv-unified. Performance report: 90 t/s on RTX 4090D 48GB

#18

by SlavikF - opened Jan 27

Discussion

SlavikF

Jan 27

•

edited Jan 27

Recently llama.cpp added new flag:

--kv-unified
use single unified KV buffer shared across all sequences

kv-unified didn't affect the speed for me, but saved few GBs of VRAM. Now max context fit into 48GB VRAM.
Actual VRAM usage is ~44GB.

Running UD-Q8_K_XL on my system with RTX 4090D 48GB.

prompt processing: 4000 t/s
token generation: 90 t/s

docker compose:

services:
  llama-router:
    image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b7842
    container_name: router
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/llama.cpp/router:/root/.cache/llama.cpp/router
      - ./models.ini:/app/models.ini
    entrypoint: ["./llama-server"]
    command: >
      --models-dir /root/.cache/llama.cpp/router
      --models-max 1
      --models-preset ./models.ini
      --host 0.0.0.0  --port 8080

models.ini

version = 1

[local-GLM4.7-30b]
ctx-size=202752
temp=0.7
top-p=1.0
min-p=0.01
jinja=1
kv-unified=1
fit=off

Mostly, everything works great. But sometimes it getting into the loop with very repeated reasoning...

danielhanchen

Unsloth AI org Jan 27

Oh very nice!

mcfadyeni

Jan 27

•

edited Jan 27

Unified KV is now enabled by default in llama.cpp if the number of slots is auto (-1), and since that's the default for number of slots, Unified KV should also be the default 😊

SlavikF

Jan 27

It was not enabled by default for me

mcfadyeni

Jan 27

Interesting, perhaps there's some condition that affects whether it's enabled by default? Maybe GPU architecture or context size?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment