reduce repetition

#1
by mag1art - opened
llama-server -hf 12bitmisfit/Qwen3-Coder-30B-A3B-Instruct_Pruned_REAP-15B-A3B-GGUF:Q4_K_M -ngl 99 -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.6 --host 0.0.0.0 --alias Qwen3-Coder-30B-A3B-Instruct_Q4 --jinja

Thank you. it works, but how do you reduce repetition? especially in long multi turn conversation.

you could try adding --repeat-penalty = 1.3

If I were you i'd look into using router mode on llama-server by providing it with a ini file, makes managing different load settings for one model and/or load settings for multiple models a bit easier.

A launch command could look like:

./llama-server --models-preset ./models.ini --port 8080 --models-max 1

while the ini file could look like:

version = 1

[*]
flash-attn = on
cache-type-k = q8_0
cache-type-v = q8_0 
metrics = on
jinja = on
spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 0
draft-max = 64
repeat-penalty = 1.3
b = 2048
ub = 2048

[Qwen3-Coder-30B-A3B-Instruct_Q4_non_creative]
model = path/to/your/gguf_file.gguf
c = 32768
temp = 0.2
ngl = -1


[Qwen3-Coder-30B-A3B-Instruct_Q4_creative]
model = path/to/your/gguf_file.gguf
c = 32768
temp = 0.6
ngl = -1

[Qwen3-Coder-30B-A3B-Instruct_Q4_lq_extra_ctx]
model = path/to/your/gguf_file.gguf
c = 65536
temp = 0.6
ngl = -1
cache-type-k = q4_0
cache-type-v = q4_0 

[Qwen3-Coder-30B-A3B-Instruct_Q4_hq_extra_ectx]
model = path/to/your/gguf_file.gguf
c = 65536
temp = 0.6
cmoe = 1 # offloads mostly to system ram for MoE models

Sign up or log in to comment