Can not reproduce evaluation results on SWE-Verified
#63
by cppowboy - opened
Tried to evaluate Qwen3.5 397B on SWE-Bench Verified, using Openhands or SWE-Agent scaffold, temperature 0.6, topp 0.95, topk 20, and max input and output length.
The resolve rate on SWE-Bench Verified turn out to be about 60%. Are there any tricks to evaluate Qwen3.5 on SWE tasks?
cppowboy changed discussion status to closed