Not working

#1
by bgeneto - opened

It doesn't work for me, someone else had success with it. I'm using unsloth gemma-4-E4B-it-UD-Q4_K_XL.gguf and gemma-4-E4B-it-assistant.Q8_0.gguf. llama cpp called this way:

CUDA_VISIBLE_DEVICES=1 ./build/bin/llama-server \
    --model ~/models/gemma-4-E4B-it-UD-Q4_K_XL.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --ctx-size 65536 \
    --flash-attn on \
    --mtp-head ~/models/gemma-4-E4B-it-assistant.Q8_0.gguf \
    --spec-type mtp \
    --draft-block-size 2 --draft-max 8 --draft-min 0 \
    -ngl 99 -ngld 99 \
    -ctk q4_0 -ctv q4_0 -ctkd q4_0 -ctvd q4_0 \
    --batch-size 1024 \
    --ubatch-size 1024 \
    --parallel 1 \
    --threads 8 \
    --threads-batch 8 \
    --cache-ram -1 \
    --no-mmap \
    --jinja \
    --host 0.0.0.0 \
    --port 8002

But no single request works, getting this error:

/app/ggml/src/ggml.c:3665: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed

Sign up or log in to comment