Anyone noticed the diffrence of sampling parameters between 27B and 35B-A3B (both 3.6)

#10
by bash99 - opened

For 27B

  • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

For 35B-A3B

  • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

My first question

35B-A3B thinking mode, general tasks presence_penalty=1.5, code task presence_penalty=0.0
27B thinking mode, general tasks presence_penalty=0.0, code task presence_penalty=0.0
Why diffrence like this?

My Second question

If I use dashscope or self-host, and code with claude code, should I set temperature=0.6 and presence_penalty=0.0? or dashscope do it automaticly?

bash99 changed discussion title from Anyone notify the diffrence of sampling parameters between 27B and 35B-A3B (both 3.6) to Anyone noticed the diffrence of sampling parameters between 27B and 35B-A3B (both 3.6)

To your first question: Most models are trained in different ways, MoE models may require different settings for optimal performance. 27B isn't an MoE so it requires different options.

Sign up or log in to comment