lm_eval leaderboard benchmark

by selimaktas - opened 29 days ago

This version seems to match/increase performance across general benchmarks! Great work

Qwen/Qwen3.5-9B:

Groups	Version	Filter	Metric		Value		Stderr
leaderboard	1	none	acc	↑	0.5490	±	0.0045
		none	acc_norm	↑	0.5739	±	0.0052
		none	exact_match	↑	0.3965	±	0.0128
		none	inst_level_loose_acc	↑	0.6379	±	N/A
		none	inst_level_strict_acc	↑	0.6163	±	N/A
		none	prompt_level_loose_acc	↑	0.5083	±	0.0215
		none	prompt_level_strict_acc	↑	0.4806	±	0.0215
- leaderboard_bbh		none	acc_norm	↑	0.6190	±	0.0058
- leaderboard_gpqa		none	acc_norm	↑	0.4446	±	0.0144
- leaderboard_math_hard		none	exact_match	↑	0.3965	±	0.0128
- leaderboard_musr		none	acc_norm	↑	0.4339	±	0.0176

Jackrong/Qwen3.5-9B-Neo:

Groups	Version	Filter	Metric		Value		Stderr
leaderboard	1	none	acc	↑	0.5348	±	0.0045
		none	acc_norm	↑	0.5784	±	0.0051
		none	exact_match	↑	0.4063	±	0.0124
		none	inst_level_loose_acc	↑	0.5564	±	N/A
		none	inst_level_strict_acc	↑	0.5264	±	N/A
		none	prompt_level_loose_acc	↑	0.4140	±	0.0212
		none	prompt_level_strict_acc	↑	0.3789	±	0.0209
- leaderboard_bbh		none	acc_norm	↑	0.6277	±	0.0057
- leaderboard_gpqa		none	acc_norm	↑	0.4136	±	0.0143
- leaderboard_math_hard		none	exact_match	↑	0.4063	±	0.0124
- leaderboard_musr		none	acc_norm	↑	0.4630	±	0.0179

grenkoca

28 days ago

Thanks for running these benchmarks! A side-by-side comparison seems like the "Neo" version generally matches the performance, but I don't think we can confidently say it increases anything (at least in a statistical sense, though I haven't run t-tests):

At most it looks like a tradeoff between musr (Neo is marginally higher)/gpqa (stock 9B is higher)

Jackrong

Owner 28 days ago

This version seems to match/increase performance across general benchmarks! Great work

Qwen/Qwen3.5-9B:

Groups Version Filter n-shot Metric Value Stderr

leaderboard 1 none acc ↑ 0.5490 ± 0.0045

none acc_norm ↑ 0.5739 ± 0.0052

none exact_match ↑ 0.3965 ± 0.0128

none inst_level_loose_acc ↑ 0.6379 ± N/A

none inst_level_strict_acc ↑ 0.6163 ± N/A

none prompt_level_loose_acc ↑ 0.5083 ± 0.0215

none prompt_level_strict_acc ↑ 0.4806 ± 0.0215

- leaderboard_bbh none acc_norm ↑ 0.6190 ± 0.0058

- leaderboard_gpqa none acc_norm ↑ 0.4446 ± 0.0144

- leaderboard_math_hard none exact_match ↑ 0.3965 ± 0.0128

- leaderboard_musr none acc_norm ↑ 0.4339 ± 0.0176

Jackrong/Qwen3.5-9B-Neo:

Groups Version Filter n-shot Metric Value Stderr

leaderboard 1 none acc ↑ 0.5348 ± 0.0045

none acc_norm ↑ 0.5784 ± 0.0051

none exact_match ↑ 0.4063 ± 0.0124

none inst_level_loose_acc ↑ 0.5564 ± N/A

none inst_level_strict_acc ↑ 0.5264 ± N/A

none prompt_level_loose_acc ↑ 0.4140 ± 0.0212

none prompt_level_strict_acc ↑ 0.3789 ± 0.0209

- leaderboard_bbh none acc_norm ↑ 0.6277 ± 0.0057

- leaderboard_gpqa none acc_norm ↑ 0.4136 ± 0.0143

- leaderboard_math_hard none exact_match ↑ 0.4063 ± 0.0124

- leaderboard_musr none acc_norm ↑ 0.4630 ± 0.0179

Really appreciate you taking the time to run these benchmarks — it means a lot! 🙏
With your permission, I’d love to reference your results in the model card.

selimaktas

28 days ago

Thanks for running these benchmarks! A side-by-side comparison seems like the "Neo" version generally matches the performance, but I don't think we can confidently say it increases anything (at least in a statistical sense, though I haven't run t-tests):

I’d say the real improvement here is being able to fix Qwen’s overthinking problem without heavy accuracy loss.

selimaktas

28 days ago

This version seems to match/increase performance across general benchmarks! Great work

Qwen/Qwen3.5-9B:

Groups Version Filter n-shot Metric Value Stderr

leaderboard 1 none acc ↑ 0.5490 ± 0.0045

none acc_norm ↑ 0.5739 ± 0.0052

none exact_match ↑ 0.3965 ± 0.0128

none inst_level_loose_acc ↑ 0.6379 ± N/A

none inst_level_strict_acc ↑ 0.6163 ± N/A

none prompt_level_loose_acc ↑ 0.5083 ± 0.0215

none prompt_level_strict_acc ↑ 0.4806 ± 0.0215

- leaderboard_bbh none acc_norm ↑ 0.6190 ± 0.0058

- leaderboard_gpqa none acc_norm ↑ 0.4446 ± 0.0144

- leaderboard_math_hard none exact_match ↑ 0.3965 ± 0.0128

- leaderboard_musr none acc_norm ↑ 0.4339 ± 0.0176

Jackrong/Qwen3.5-9B-Neo:

Groups Version Filter n-shot Metric Value Stderr

leaderboard 1 none acc ↑ 0.5348 ± 0.0045

none acc_norm ↑ 0.5784 ± 0.0051

none exact_match ↑ 0.4063 ± 0.0124

none inst_level_loose_acc ↑ 0.5564 ± N/A

none inst_level_strict_acc ↑ 0.5264 ± N/A

none prompt_level_loose_acc ↑ 0.4140 ± 0.0212

none prompt_level_strict_acc ↑ 0.3789 ± 0.0209

- leaderboard_bbh none acc_norm ↑ 0.6277 ± 0.0057

- leaderboard_gpqa none acc_norm ↑ 0.4136 ± 0.0143

- leaderboard_math_hard none exact_match ↑ 0.4063 ± 0.0124

- leaderboard_musr none acc_norm ↑ 0.4630 ± 0.0179

Really appreciate you taking the time to run these benchmarks — it means a lot! 🙏
With your permission, I’d love to reference your results in the model card.

I would be happy to see that, let me know if you need more detailed results!

kraven1109

22 days ago

thanks for the great model and benchmark info
I have a plan to use the 9B model as a local AI agent via llama.cpp+proxy tools
I'm confused of choosing between your Neo and Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF, please give me some advice.
Thanks in advance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment