Can not reproduce evaluation results on SWE-Verified

#63
by cppowboy - opened

Tried to evaluate Qwen3.5 397B on SWE-Bench Verified, using Openhands or SWE-Agent scaffold, temperature 0.6, topp 0.95, topk 20, and max input and output length.

The resolve rate on SWE-Bench Verified turn out to be about 60%. Are there any tricks to evaluate Qwen3.5 on SWE tasks?

cppowboy changed discussion status to closed

Sign up or log in to comment