Gibberish

#3
by labhund - opened

Fits in 16GB VRAM as advertised.
Using RTX 5080 NVIDIA GPU

Unfortunately it seems to output gibbersih:
image

Reasoning output (never made it to non-reasoning):

珊败cadvigragragwynrag败败珊珊珊wyngdubitudence败letcherletcher羽othi xposubit柔珊禹gdarrowwyn溅误溅羽I淹gd羽教会割 Cad教会败败败珊创立羽uka羽创立羽Ad珊ragvig kad_教会AdAd败ubit败rag误误örd珊rag羽羽羽udence禹迂羽uka羽vig败羽 Cad羽I溅raggdrag淹Ialg禹 cadrag教会珊 cad仆柔ubit败割 Cad禹IukaubitAppe充创立溅教会势创立 cad珊gd园地arrowrag Cad强身 cadAd割珊禹gdothiördragarrowAdarrowragoccaarrow

Qwopus3.5
27B
v3
TQ3_4S.gguf

Serve Script (I built your repo):

#!/usr/bin/env bash

set -euo pipefail

MODEL="/data1/models/qwopus3.5-27b-v3-tq3_4s/Qwopus3.5-27B-v3-TQ3_4S.gguf"
PORT="${1:-8003}"
CTX_SIZE=1000
PARALLEL=4
BATCH_SIZE=2048
UBATCH_SIZE=512
GPU_LAYERS=99
LLAMA_SERVER="$HOME/repos/llama.cpp-tq3/build/bin/llama-server"

if [ ! -f "$MODEL" ]; then
echo "ERROR: Model not found at $MODEL"
exit 1
fi

if [ ! -x "$LLAMA_SERVER" ]; then
echo "ERROR: llama-server (CUDA) not found at $LLAMA_SERVER"
exit 1
fi

echo "Starting Qwen3.5-4B Q8 on GPU"
echo "Port: $PORT | Context: $CTX_SIZE | Parallel: $PARALLEL | GPU layers: $GPU_LAYERS"

exec "$LLAMA_SERVER"
--model "$MODEL"
--port "$PORT"
--host 0.0.0.0
--n-gpu-layers "$GPU_LAYERS"
--ctx-size "$CTX_SIZE"
--batch-size "$BATCH_SIZE"
--ubatch-size "$UBATCH_SIZE"
--parallel "$PARALLEL"
--cont-batching
--flash-attn on
--jinja
--no-warmup

I changed it to the reccomended settings in the repo. Still gibbersih.

image

Script:

#!/usr/bin/env bash

set -euo pipefail

MODEL="/data1/models/qwopus3.5-27b-v3-tq3_4s/Qwopus3.5-27B-v3-TQ3_4S.gguf"
PORT="${1:-8003}"
CTX_SIZE=1000
PARALLEL=4
BATCH_SIZE=2048
UBATCH_SIZE=512
GPU_LAYERS=99
LLAMA_SERVER="$HOME/repos/llama.cpp-tq3/build/bin/llama-server"

if [ ! -f "$MODEL" ]; then
echo "ERROR: Model not found at $MODEL"
exit 1
fi

if [ ! -x "$LLAMA_SERVER" ]; then
echo "ERROR: llama-server (CUDA) not found at $LLAMA_SERVER"
exit 1
fi

echo "Starting Qwen3.5-4B Q8 on GPU"
echo "Port: $PORT | Context: $CTX_SIZE | Parallel: $PARALLEL | GPU layers: $GPU_LAYERS"

exec "$LLAMA_SERVER"
--model "$MODEL"
--port "$PORT"
--host 0.0.0.0
--n-gpu-layers "$GPU_LAYERS"
--ctx-size "$CTX_SIZE"
--batch-size "$BATCH_SIZE"
--ubatch-size "$UBATCH_SIZE"
--parallel "$PARALLEL"
--cont-batching
--flash-attn on
--jinja
--reasoning on --reasoning-budget 0 --temp 0.6 --top-k 20 --min-p 0 --repeat-penalty 1.0
--no-warmup

Will get back to you. On holiday atm

I changed it to the reccomended settings in the repo. Still gibbersih.

image

Script:

#!/usr/bin/env bash

set -euo pipefail

MODEL="/data1/models/qwopus3.5-27b-v3-tq3_4s/Qwopus3.5-27B-v3-TQ3_4S.gguf"
PORT="${1:-8003}"
CTX_SIZE=1000
PARALLEL=4
BATCH_SIZE=2048
UBATCH_SIZE=512
GPU_LAYERS=99
LLAMA_SERVER="$HOME/repos/llama.cpp-tq3/build/bin/llama-server"

if [ ! -f "$MODEL" ]; then
echo "ERROR: Model not found at $MODEL"
exit 1
fi

if [ ! -x "$LLAMA_SERVER" ]; then
echo "ERROR: llama-server (CUDA) not found at $LLAMA_SERVER"
exit 1
fi

echo "Starting Qwen3.5-4B Q8 on GPU"
echo "Port: $PORT | Context: $CTX_SIZE | Parallel: $PARALLEL | GPU layers: $GPU_LAYERS"

exec "$LLAMA_SERVER"
--model "$MODEL"
--port "$PORT"
--host 0.0.0.0
--n-gpu-layers "$GPU_LAYERS"
--ctx-size "$CTX_SIZE"
--batch-size "$BATCH_SIZE"
--ubatch-size "$UBATCH_SIZE"
--parallel "$PARALLEL"
--cont-batching
--flash-attn on
--jinja
--reasoning on --reasoning-budget 0 --temp 0.6 --top-k 20 --min-p 0 --repeat-penalty 1.0
--no-warmup

can I check that you are using the latest code from main branch?

╭─  llama.cpp-tq3 on  main [!✓]
╰─ ➜ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
(use "git add ..." to update what will be committed)
(use "git restore ..." to discard changes in working directory)
modified: ggml/src/ggml-cuda/ggml-cuda.cu

no changes added to commit (use "git add" and/or "git commit -a")

╭─  llama.cpp-tq3 on  main [!✓]
╰─ ➜ git log
commit 1cfa8910c6ff939f7caf684cf91a5e499129f2cf (HEAD -> main, tag: main-b8665-1cfa891, origin/main, origin/feature/tq3-decode-vdr16, origin/HEAD)
Author: charpdev charpdev@users.noreply.github.com
Date: Mon Apr 6 11:56:31 2026 +0100

perf: VDR=16 for TQ3_4S MMVQ on SM120 Blackwell

Process 16 weight elements (2 subgroups) per thread in decode kernel.

PP 27B: 353 tok/s (unchanged from base)
TG 27B: 15.3 → 25.0 tok/s (+63%)
PPL 9B: 7.8197 (identical)

╭─  llama.cpp-tq3 on  main [!✓]
╰─ ✗ git diff took 1m16s
diff --git i/ggml/src/ggml-cuda/ggml-cuda.cu w/ggml/src/ggml-cuda/ggml-cuda.cu
index 04ac27a80..62a08c010 100644
--- i/ggml/src/ggml-cuda/ggml-cuda.cu
+++ w/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -63,7 +63,7 @@
#include "ggml-cuda/fill.cuh"
#include "ggml-quants.h"
#include "ggml-cuda/tq3-native.cuh"
-static void ggml_cuda_op_turbo_wht(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
+#include "ggml-cuda/turbo-wht.cuh"
#include "ggml-cuda/tq3-prefill.cuh"
#include "ggml.h"

╭─  llama.cpp-tq3 on  main [!✓]
╰─ ➜

^^ The change was made because otherwise the code would not compile

this has been fixed

Nice! I'm sorry but I also learned recently there was an issue with CUDA 13.2 that might have been contributing to the problem!!

Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

Mon Apr 13 17:18:47 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |

13.0.2

So it should work

YTan2000 changed discussion status to closed

Sign up or log in to comment