lm_eval leaderboard benchmark
This version seems to match/increase performance across general benchmarks! Great work
Qwen/Qwen3.5-9B:
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| leaderboard | 1 | none | acc | ↑ | 0.5490 | ± | 0.0045 | |
| none | acc_norm | ↑ | 0.5739 | ± | 0.0052 | |||
| none | exact_match | ↑ | 0.3965 | ± | 0.0128 | |||
| none | inst_level_loose_acc | ↑ | 0.6379 | ± | N/A | |||
| none | inst_level_strict_acc | ↑ | 0.6163 | ± | N/A | |||
| none | prompt_level_loose_acc | ↑ | 0.5083 | ± | 0.0215 | |||
| none | prompt_level_strict_acc | ↑ | 0.4806 | ± | 0.0215 | |||
| - leaderboard_bbh | none | acc_norm | ↑ | 0.6190 | ± | 0.0058 | ||
| - leaderboard_gpqa | none | acc_norm | ↑ | 0.4446 | ± | 0.0144 | ||
| - leaderboard_math_hard | none | exact_match | ↑ | 0.3965 | ± | 0.0128 | ||
| - leaderboard_musr | none | acc_norm | ↑ | 0.4339 | ± | 0.0176 |
Jackrong/Qwen3.5-9B-Neo:
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| leaderboard | 1 | none | acc | ↑ | 0.5348 | ± | 0.0045 | |
| none | acc_norm | ↑ | 0.5784 | ± | 0.0051 | |||
| none | exact_match | ↑ | 0.4063 | ± | 0.0124 | |||
| none | inst_level_loose_acc | ↑ | 0.5564 | ± | N/A | |||
| none | inst_level_strict_acc | ↑ | 0.5264 | ± | N/A | |||
| none | prompt_level_loose_acc | ↑ | 0.4140 | ± | 0.0212 | |||
| none | prompt_level_strict_acc | ↑ | 0.3789 | ± | 0.0209 | |||
| - leaderboard_bbh | none | acc_norm | ↑ | 0.6277 | ± | 0.0057 | ||
| - leaderboard_gpqa | none | acc_norm | ↑ | 0.4136 | ± | 0.0143 | ||
| - leaderboard_math_hard | none | exact_match | ↑ | 0.4063 | ± | 0.0124 | ||
| - leaderboard_musr | none | acc_norm | ↑ | 0.4630 | ± | 0.0179 |
Thanks for running these benchmarks! A side-by-side comparison seems like the "Neo" version generally matches the performance, but I don't think we can confidently say it increases anything (at least in a statistical sense, though I haven't run t-tests):
At most it looks like a tradeoff between musr (Neo is marginally higher)/gpqa (stock 9B is higher)
This version seems to match/increase performance across general benchmarks! Great work
Qwen/Qwen3.5-9B:
Groups Version Filter n-shot Metric Value Stderr leaderboard 1 none acc ↑ 0.5490 ± 0.0045 none acc_norm ↑ 0.5739 ± 0.0052 none exact_match ↑ 0.3965 ± 0.0128 none inst_level_loose_acc ↑ 0.6379 ± N/A none inst_level_strict_acc ↑ 0.6163 ± N/A none prompt_level_loose_acc ↑ 0.5083 ± 0.0215 none prompt_level_strict_acc ↑ 0.4806 ± 0.0215 - leaderboard_bbh none acc_norm ↑ 0.6190 ± 0.0058 - leaderboard_gpqa none acc_norm ↑ 0.4446 ± 0.0144 - leaderboard_math_hard none exact_match ↑ 0.3965 ± 0.0128 - leaderboard_musr none acc_norm ↑ 0.4339 ± 0.0176 Jackrong/Qwen3.5-9B-Neo:
Groups Version Filter n-shot Metric Value Stderr leaderboard 1 none acc ↑ 0.5348 ± 0.0045 none acc_norm ↑ 0.5784 ± 0.0051 none exact_match ↑ 0.4063 ± 0.0124 none inst_level_loose_acc ↑ 0.5564 ± N/A none inst_level_strict_acc ↑ 0.5264 ± N/A none prompt_level_loose_acc ↑ 0.4140 ± 0.0212 none prompt_level_strict_acc ↑ 0.3789 ± 0.0209 - leaderboard_bbh none acc_norm ↑ 0.6277 ± 0.0057 - leaderboard_gpqa none acc_norm ↑ 0.4136 ± 0.0143 - leaderboard_math_hard none exact_match ↑ 0.4063 ± 0.0124 - leaderboard_musr none acc_norm ↑ 0.4630 ± 0.0179
Really appreciate you taking the time to run these benchmarks — it means a lot! 🙏
With your permission, I’d love to reference your results in the model card.
Thanks for running these benchmarks! A side-by-side comparison seems like the "Neo" version generally matches the performance, but I don't think we can confidently say it increases anything (at least in a statistical sense, though I haven't run t-tests):
I’d say the real improvement here is being able to fix Qwen’s overthinking problem without heavy accuracy loss.
This version seems to match/increase performance across general benchmarks! Great work
Qwen/Qwen3.5-9B:
Groups Version Filter n-shot Metric Value Stderr leaderboard 1 none acc ↑ 0.5490 ± 0.0045 none acc_norm ↑ 0.5739 ± 0.0052 none exact_match ↑ 0.3965 ± 0.0128 none inst_level_loose_acc ↑ 0.6379 ± N/A none inst_level_strict_acc ↑ 0.6163 ± N/A none prompt_level_loose_acc ↑ 0.5083 ± 0.0215 none prompt_level_strict_acc ↑ 0.4806 ± 0.0215 - leaderboard_bbh none acc_norm ↑ 0.6190 ± 0.0058 - leaderboard_gpqa none acc_norm ↑ 0.4446 ± 0.0144 - leaderboard_math_hard none exact_match ↑ 0.3965 ± 0.0128 - leaderboard_musr none acc_norm ↑ 0.4339 ± 0.0176 Jackrong/Qwen3.5-9B-Neo:
Groups Version Filter n-shot Metric Value Stderr leaderboard 1 none acc ↑ 0.5348 ± 0.0045 none acc_norm ↑ 0.5784 ± 0.0051 none exact_match ↑ 0.4063 ± 0.0124 none inst_level_loose_acc ↑ 0.5564 ± N/A none inst_level_strict_acc ↑ 0.5264 ± N/A none prompt_level_loose_acc ↑ 0.4140 ± 0.0212 none prompt_level_strict_acc ↑ 0.3789 ± 0.0209 - leaderboard_bbh none acc_norm ↑ 0.6277 ± 0.0057 - leaderboard_gpqa none acc_norm ↑ 0.4136 ± 0.0143 - leaderboard_math_hard none exact_match ↑ 0.4063 ± 0.0124 - leaderboard_musr none acc_norm ↑ 0.4630 ± 0.0179 Really appreciate you taking the time to run these benchmarks — it means a lot! 🙏
With your permission, I’d love to reference your results in the model card.
I would be happy to see that, let me know if you need more detailed results!
thanks for the great model and benchmark info
I have a plan to use the 9B model as a local AI agent via llama.cpp+proxy tools
I'm confused of choosing between your Neo and Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF, please give me some advice.
Thanks in advance.