memory error when loading 27b model on 20gb vram gpu

#10
by selmee - opened

As doc states, 27b model runs on 17 GB, but when I try it on 20gb vram gpu, I got the problem error. what could be the possible cause?

dministrator@WIN-FCJKKK2BOGK:/mnt/d/vlm_script/llama$ ./llama.cpp/llama-server --model unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf --mmproj unsloth/Qwen3.5-27B-GGUF/mmproj-F16.gguf --alias "unsloth/Qwen3.5-27B" --temp 0.6 --main-gpu 5 --top-p 0.95 --ctx-size 8192 --top-k 20 --min-p 0.00 --port 8001 --chat-template-kwargs "{"enable_thinking": false}"
ggml_cuda_init: failed to initialize CUDA: out of memory
warning: llama.cpp was compiled without support for GPU offload. Setting the main GPU has no effect.
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8157 (2943210c1) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 64, n_threads_batch = 64, total_threads = 128

system_info: n_threads = 64 (n_threads_batch = 64) / 128 | CUDA : ARCHS = 860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: using 127 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
/mnt/d/vlm_script/llama/llama.cpp/ggml/src/ggml-backend.cpp:1151: GGML_ASSERT(*cur_backend_id != -1) failed

Probably you have some other applications using the GPU as well, best check the nvidia-smi output.

Sign up or log in to comment