Thanks for the Strong Improvements in Coding and Terminal-Bench 2.0

#19
by mfzzzzzz - opened

Really impressive work on Kimi K2.6!

Regarding Terminal-Bench 2.0, the model card mentions that results were obtained using the default Terminus-2 framework, JSON parser, and preserve thinking mode.

Could you clarify a few details about the evaluation setup?

  • Were per-task timeouts and resource limits strictly followed?
  • Is there any chance to share evaluation logs or trajectories?

Thanks!

The improvements in coding and terminal-based tasks are very noticeable, and the strong performance on Terminal-Bench 2.0 is especially exciting to see. Thanks for the great work!

mfzzzzzz changed discussion title from Question on Reproducibility of Terminal-Bench 2.0 Results (Kimi K2.6 – 66.7 Score) to Thanks for the Strong Improvements in Coding and Terminal-Bench 2.0

Sign up or log in to comment