Anyone noticed the diffrence of sampling parameters between 27B and 35B-A3B (both 3.6)
#10
by bash99 - opened
For 27B
- Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
- Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
- Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
For 35B-A3B
- Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
- Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
- Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
- Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
My first question
35B-A3B thinking mode, general tasks presence_penalty=1.5, code task presence_penalty=0.0
27B thinking mode, general tasks presence_penalty=0.0, code task presence_penalty=0.0
Why diffrence like this?
My Second question
If I use dashscope or self-host, and code with claude code, should I set temperature=0.6 and presence_penalty=0.0? or dashscope do it automaticly?
bash99 changed discussion title from Anyone notify the diffrence of sampling parameters between 27B and 35B-A3B (both 3.6) to Anyone noticed the diffrence of sampling parameters between 27B and 35B-A3B (both 3.6)
To your first question: Most models are trained in different ways, MoE models may require different settings for optimal performance. 27B isn't an MoE so it requires different options.