lm_eval leaderboard benchmark

#1
by selimaktas - opened

This version seems to match/increase performance across general benchmarks! Great work

Qwen/Qwen3.5-9B:

Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5490 ± 0.0045
none acc_norm 0.5739 ± 0.0052
none exact_match 0.3965 ± 0.0128
none inst_level_loose_acc 0.6379 ± N/A
none inst_level_strict_acc 0.6163 ± N/A
none prompt_level_loose_acc 0.5083 ± 0.0215
none prompt_level_strict_acc 0.4806 ± 0.0215
- leaderboard_bbh none acc_norm 0.6190 ± 0.0058
- leaderboard_gpqa none acc_norm 0.4446 ± 0.0144
- leaderboard_math_hard none exact_match 0.3965 ± 0.0128
- leaderboard_musr none acc_norm 0.4339 ± 0.0176

Jackrong/Qwen3.5-9B-Neo:

Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5348 ± 0.0045
none acc_norm 0.5784 ± 0.0051
none exact_match 0.4063 ± 0.0124
none inst_level_loose_acc 0.5564 ± N/A
none inst_level_strict_acc 0.5264 ± N/A
none prompt_level_loose_acc 0.4140 ± 0.0212
none prompt_level_strict_acc 0.3789 ± 0.0209
- leaderboard_bbh none acc_norm 0.6277 ± 0.0057
- leaderboard_gpqa none acc_norm 0.4136 ± 0.0143
- leaderboard_math_hard none exact_match 0.4063 ± 0.0124
- leaderboard_musr none acc_norm 0.4630 ± 0.0179

Thanks for running these benchmarks! A side-by-side comparison seems like the "Neo" version generally matches the performance, but I don't think we can confidently say it increases anything (at least in a statistical sense, though I haven't run t-tests):
image

At most it looks like a tradeoff between musr (Neo is marginally higher)/gpqa (stock 9B is higher)

This version seems to match/increase performance across general benchmarks! Great work

Qwen/Qwen3.5-9B:

Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5490 ± 0.0045
none acc_norm 0.5739 ± 0.0052
none exact_match 0.3965 ± 0.0128
none inst_level_loose_acc 0.6379 ± N/A
none inst_level_strict_acc 0.6163 ± N/A
none prompt_level_loose_acc 0.5083 ± 0.0215
none prompt_level_strict_acc 0.4806 ± 0.0215
- leaderboard_bbh none acc_norm 0.6190 ± 0.0058
- leaderboard_gpqa none acc_norm 0.4446 ± 0.0144
- leaderboard_math_hard none exact_match 0.3965 ± 0.0128
- leaderboard_musr none acc_norm 0.4339 ± 0.0176

Jackrong/Qwen3.5-9B-Neo:

Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5348 ± 0.0045
none acc_norm 0.5784 ± 0.0051
none exact_match 0.4063 ± 0.0124
none inst_level_loose_acc 0.5564 ± N/A
none inst_level_strict_acc 0.5264 ± N/A
none prompt_level_loose_acc 0.4140 ± 0.0212
none prompt_level_strict_acc 0.3789 ± 0.0209
- leaderboard_bbh none acc_norm 0.6277 ± 0.0057
- leaderboard_gpqa none acc_norm 0.4136 ± 0.0143
- leaderboard_math_hard none exact_match 0.4063 ± 0.0124
- leaderboard_musr none acc_norm 0.4630 ± 0.0179

Really appreciate you taking the time to run these benchmarks — it means a lot! 🙏
With your permission, I’d love to reference your results in the model card.

Thanks for running these benchmarks! A side-by-side comparison seems like the "Neo" version generally matches the performance, but I don't think we can confidently say it increases anything (at least in a statistical sense, though I haven't run t-tests):
image

I’d say the real improvement here is being able to fix Qwen’s overthinking problem without heavy accuracy loss.

This version seems to match/increase performance across general benchmarks! Great work

Qwen/Qwen3.5-9B:

Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5490 ± 0.0045
none acc_norm 0.5739 ± 0.0052
none exact_match 0.3965 ± 0.0128
none inst_level_loose_acc 0.6379 ± N/A
none inst_level_strict_acc 0.6163 ± N/A
none prompt_level_loose_acc 0.5083 ± 0.0215
none prompt_level_strict_acc 0.4806 ± 0.0215
- leaderboard_bbh none acc_norm 0.6190 ± 0.0058
- leaderboard_gpqa none acc_norm 0.4446 ± 0.0144
- leaderboard_math_hard none exact_match 0.3965 ± 0.0128
- leaderboard_musr none acc_norm 0.4339 ± 0.0176

Jackrong/Qwen3.5-9B-Neo:

Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5348 ± 0.0045
none acc_norm 0.5784 ± 0.0051
none exact_match 0.4063 ± 0.0124
none inst_level_loose_acc 0.5564 ± N/A
none inst_level_strict_acc 0.5264 ± N/A
none prompt_level_loose_acc 0.4140 ± 0.0212
none prompt_level_strict_acc 0.3789 ± 0.0209
- leaderboard_bbh none acc_norm 0.6277 ± 0.0057
- leaderboard_gpqa none acc_norm 0.4136 ± 0.0143
- leaderboard_math_hard none exact_match 0.4063 ± 0.0124
- leaderboard_musr none acc_norm 0.4630 ± 0.0179

Really appreciate you taking the time to run these benchmarks — it means a lot! 🙏
With your permission, I’d love to reference your results in the model card.

I would be happy to see that, let me know if you need more detailed results!

thanks for the great model and benchmark info
I have a plan to use the 9B model as a local AI agent via llama.cpp+proxy tools
I'm confused of choosing between your Neo and Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF, please give me some advice.
Thanks in advance.

Sign up or log in to comment