Thanks for the Strong Improvements in Coding and Terminal-Bench 2.0
#19
by mfzzzzzz - opened
Really impressive work on Kimi K2.6!
Regarding Terminal-Bench 2.0, the model card mentions that results were obtained using the default Terminus-2 framework, JSON parser, and preserve thinking mode.
Could you clarify a few details about the evaluation setup?
- Were per-task timeouts and resource limits strictly followed?
- Is there any chance to share evaluation logs or trajectories?
Thanks!
The improvements in coding and terminal-based tasks are very noticeable, and the strong performance on Terminal-Bench 2.0 is especially exciting to see. Thanks for the great work!
mfzzzzzz changed discussion title from Question on Reproducibility of Terminal-Bench 2.0 Results (Kimi K2.6 – 66.7 Score) to Thanks for the Strong Improvements in Coding and Terminal-Bench 2.0