Identical models but different scores?
#605
by TPH441 - opened
Bobi099/Qwen3.5-27B-heretic is literally just a duplicated repo of coder3101/Qwen3.5-27B-heretic, but they have different scores on the leadereboard.
Yeah, I'm not able to test models fully deterministically, so there is a bit of variance between tests. vllm batching kinda inherently isn't deterministic, thinking models need randomness to think through unique ideas, and models don't write as well when using deterministic settings.
DontPlanToEnd changed discussion status to closed