memory error when loading 27b model on 20gb vram gpu

#10

by selmee - opened Feb 27

Feb 27

As doc states, 27b model runs on 17 GB, but when I try it on 20gb vram gpu, I got the problem error. what could be the possible cause?

dministrator@WIN-FCJKKK2BOGK:/mnt/d/vlm_script/llama$ ./llama.cpp/llama-server --model unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf --mmproj unsloth/Qwen3.5-27B-GGUF/mmproj-F16.gguf --alias "unsloth/Qwen3.5-27B" --temp 0.6 --main-gpu 5 --top-p 0.95 --ctx-size 8192 --top-k 20 --min-p 0.00 --port 8001 --chat-template-kwargs "{"enable_thinking": false}"
ggml_cuda_init: failed to initialize CUDA: out of memory
warning: llama.cpp was compiled without support for GPU offload. Setting the main GPU has no effect.
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8157 (2943210c1) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 64, n_threads_batch = 64, total_threads = 128

init: using 127 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
/mnt/d/vlm_script/llama/llama.cpp/ggml/src/ggml-backend.cpp:1151: GGML_ASSERT(*cur_backend_id != -1) failed

oPnf4fMoKMz4VFq

Mar 1

Probably you have some other applications using the GPU as well, best check the nvidia-smi output.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment