Coding parameters used for Goose and Zed

#3
by dugrema - opened

The model card recommends this for coding tasks:

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

I found that for the UD Q4 version, the model gets stuck in thinking on simple prompts like "Hi" or after a few steps on more complex topics. I just increased the presence penalty to 0.1 and it seemed to work well enough. I had it fix an old unfinished Python implementation of Pacman with Zed. This is the first medium-size model (<100B params) I use that does not struggle at all with Zed's edit_file tool. It works REALLY well.

Full params for coding on llama.cpp (latest from main branch build: 8148, that fixes the template warnings):

-fitt 100 --fit-ctx 131072 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.1 --batch-size 2048 --ubatch-size 512 --n-predict 20000

That gives me 30-40 tok/sec on a RTX-5060. I figure having a repeat penalty in code is not ideal and could have side-effects. Right now I just get the occasional stoppage in the middle of a task and I don't feel like calling-up Ralph.

Does anyone have any better parameter combination for coding?

Same issue here, but the happened on 35BA3B UD-Q4_K_XL
I had to switch to Qwen3 Coder Next to continue the rest of the work.

I found that UD-Q6_K_XL works properly with --presence-penalty 0.0 as recommended by Qwen. That's what I'll go with. I get about 30 tokens/s on a RTX-5060 with 16G vram and these settings:

-fitt 100 --fit-ctx 65536 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 --ubatch-size 256 --n-predict 20000

That leaves 23 MOE layers overflowing to RAM, from what I see that's about an additional 16G used by llama.cpp.

Sign up or log in to comment