Gibberish
Fits in 16GB VRAM as advertised.
Using RTX 5080 NVIDIA GPU
Unfortunately it seems to output gibbersih:
Reasoning output (never made it to non-reasoning):
珊败cadvigragragwynrag败败珊珊珊wyngdubitudence败letcherletcher羽othi xposubit柔珊禹gdarrowwyn溅误溅羽I淹gd羽教会割 Cad教会败败败珊创立羽uka羽创立羽Ad珊ragvig kad_教会AdAd败ubit败rag误误örd珊rag羽羽羽udence禹迂羽uka羽vig败羽 Cad羽I溅raggdrag淹Ialg禹 cadrag教会珊 cad仆柔ubit败割 Cad禹IukaubitAppe充创立溅教会势创立 cad珊gd园地arrowrag Cad强身 cadAd割珊禹gdothiördragarrowAdarrowragoccaarrow
Qwopus3.5
27B
v3
TQ3_4S.gguf
Serve Script (I built your repo):
#!/usr/bin/env bash
set -euo pipefail
MODEL="/data1/models/qwopus3.5-27b-v3-tq3_4s/Qwopus3.5-27B-v3-TQ3_4S.gguf"
PORT="${1:-8003}"
CTX_SIZE=1000
PARALLEL=4
BATCH_SIZE=2048
UBATCH_SIZE=512
GPU_LAYERS=99
LLAMA_SERVER="$HOME/repos/llama.cpp-tq3/build/bin/llama-server"
if [ ! -f "$MODEL" ]; then
echo "ERROR: Model not found at $MODEL"
exit 1
fi
if [ ! -x "$LLAMA_SERVER" ]; then
echo "ERROR: llama-server (CUDA) not found at $LLAMA_SERVER"
exit 1
fi
echo "Starting Qwen3.5-4B Q8 on GPU"
echo "Port: $PORT | Context: $CTX_SIZE | Parallel: $PARALLEL | GPU layers: $GPU_LAYERS"
exec "$LLAMA_SERVER"
--model "$MODEL"
--port "$PORT"
--host 0.0.0.0
--n-gpu-layers "$GPU_LAYERS"
--ctx-size "$CTX_SIZE"
--batch-size "$BATCH_SIZE"
--ubatch-size "$UBATCH_SIZE"
--parallel "$PARALLEL"
--cont-batching
--flash-attn on
--jinja
--no-warmup
I changed it to the reccomended settings in the repo. Still gibbersih.
Script:
#!/usr/bin/env bash
set -euo pipefail
MODEL="/data1/models/qwopus3.5-27b-v3-tq3_4s/Qwopus3.5-27B-v3-TQ3_4S.gguf"
PORT="${1:-8003}"
CTX_SIZE=1000
PARALLEL=4
BATCH_SIZE=2048
UBATCH_SIZE=512
GPU_LAYERS=99
LLAMA_SERVER="$HOME/repos/llama.cpp-tq3/build/bin/llama-server"
if [ ! -f "$MODEL" ]; then
echo "ERROR: Model not found at $MODEL"
exit 1
fi
if [ ! -x "$LLAMA_SERVER" ]; then
echo "ERROR: llama-server (CUDA) not found at $LLAMA_SERVER"
exit 1
fi
echo "Starting Qwen3.5-4B Q8 on GPU"
echo "Port: $PORT | Context: $CTX_SIZE | Parallel: $PARALLEL | GPU layers: $GPU_LAYERS"
exec "$LLAMA_SERVER"
--model "$MODEL"
--port "$PORT"
--host 0.0.0.0
--n-gpu-layers "$GPU_LAYERS"
--ctx-size "$CTX_SIZE"
--batch-size "$BATCH_SIZE"
--ubatch-size "$UBATCH_SIZE"
--parallel "$PARALLEL"
--cont-batching
--flash-attn on
--jinja
--reasoning on --reasoning-budget 0 --temp 0.6 --top-k 20 --min-p 0 --repeat-penalty 1.0
--no-warmup
Will get back to you. On holiday atm
I changed it to the reccomended settings in the repo. Still gibbersih.
Script:
#!/usr/bin/env bash
set -euo pipefail
MODEL="/data1/models/qwopus3.5-27b-v3-tq3_4s/Qwopus3.5-27B-v3-TQ3_4S.gguf"
PORT="${1:-8003}"
CTX_SIZE=1000
PARALLEL=4
BATCH_SIZE=2048
UBATCH_SIZE=512
GPU_LAYERS=99
LLAMA_SERVER="$HOME/repos/llama.cpp-tq3/build/bin/llama-server"if [ ! -f "$MODEL" ]; then
echo "ERROR: Model not found at $MODEL"
exit 1
fiif [ ! -x "$LLAMA_SERVER" ]; then
echo "ERROR: llama-server (CUDA) not found at $LLAMA_SERVER"
exit 1
fiecho "Starting Qwen3.5-4B Q8 on GPU"
echo "Port: $PORT | Context: $CTX_SIZE | Parallel: $PARALLEL | GPU layers: $GPU_LAYERS"exec "$LLAMA_SERVER"
--model "$MODEL"
--port "$PORT"
--host 0.0.0.0
--n-gpu-layers "$GPU_LAYERS"
--ctx-size "$CTX_SIZE"
--batch-size "$BATCH_SIZE"
--ubatch-size "$UBATCH_SIZE"
--parallel "$PARALLEL"
--cont-batching
--flash-attn on
--jinja
--reasoning on --reasoning-budget 0 --temp 0.6 --top-k 20 --min-p 0 --repeat-penalty 1.0
--no-warmup
can I check that you are using the latest code from main branch?
╭─ llama.cpp-tq3 on main [!✓]
╰─ ➜ git status
On branch main
Your branch is up to date with 'origin/main'.
Changes not staged for commit:
(use "git add ..." to update what will be committed)
(use "git restore ..." to discard changes in working directory)
modified: ggml/src/ggml-cuda/ggml-cuda.cu
no changes added to commit (use "git add" and/or "git commit -a")
╭─ llama.cpp-tq3 on main [!✓]
╰─ ➜ git log
commit 1cfa8910c6ff939f7caf684cf91a5e499129f2cf (HEAD -> main, tag: main-b8665-1cfa891, origin/main, origin/feature/tq3-decode-vdr16, origin/HEAD)
Author: charpdev charpdev@users.noreply.github.com
Date: Mon Apr 6 11:56:31 2026 +0100
perf: VDR=16 for TQ3_4S MMVQ on SM120 Blackwell
Process 16 weight elements (2 subgroups) per thread in decode kernel.
PP 27B: 353 tok/s (unchanged from base)
TG 27B: 15.3 → 25.0 tok/s (+63%)
PPL 9B: 7.8197 (identical)
╭─ llama.cpp-tq3 on main [!✓]
╰─ ✗ git diff took 1m16s
diff --git i/ggml/src/ggml-cuda/ggml-cuda.cu w/ggml/src/ggml-cuda/ggml-cuda.cu
index 04ac27a80..62a08c010 100644
--- i/ggml/src/ggml-cuda/ggml-cuda.cu
+++ w/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -63,7 +63,7 @@
#include "ggml-cuda/fill.cuh"
#include "ggml-quants.h"
#include "ggml-cuda/tq3-native.cuh"
-static void ggml_cuda_op_turbo_wht(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
+#include "ggml-cuda/turbo-wht.cuh"
#include "ggml-cuda/tq3-prefill.cuh"
#include "ggml.h"
╭─ llama.cpp-tq3 on main [!✓]
╰─ ➜
^^ The change was made because otherwise the code would not compile
this has been fixed
Nice! I'm sorry but I also learned recently there was an issue with CUDA 13.2 that might have been contributing to the problem!!
