Benchmarks

#14
by selimaktas - opened

Hello, I benchmarked this model and base Qwen3.5-9B using the lm evaluation harness with the leaderboard task, and it seems to get a worse score across all benches.

Crownelius/Crow-9B-HERETIC:

Tasks Version Filter n-shot Metric Value Stderr
leaderboard 1.0 none acc 0.5258 ± 0.0046
none acc_norm 0.5696 ± 0.0052
none exact_match 0.3187 ± 0.0118
none inst_level_loose_acc 0.5935 ± N/A
none inst_level_strict_acc 0.5600 ± N/A
none prompt_level_loose_acc 0.4399 ± 0.0214
none prompt_level_strict_acc 0.3974 ± 0.0211
- leaderboard_bbh none acc_norm 0.6178 ± 0.0059
- leaderboard_bbh_boolean_expressions 1.0 none 3 acc_norm 0.8440 ± 0.0230
- leaderboard_bbh_causal_judgement 1.0 none 3 acc_norm 0.5615 ± 0.0364
- leaderboard_bbh_date_understanding 1.0 none 3 acc_norm 0.7240 ± 0.0283
- leaderboard_bbh_disambiguation_qa 1.0 none 3 acc_norm 0.6040 ± 0.0310
- leaderboard_bbh_formal_fallacies 1.0 none 3 acc_norm 0.4640 ± 0.0316
- leaderboard_bbh_geometric_shapes 1.0 none 3 acc_norm 0.4960 ± 0.0317
- leaderboard_bbh_hyperbaton 1.0 none 3 acc_norm 0.6560 ± 0.0301
- leaderboard_bbh_logical_deduction_five_objects 1.0 none 3 acc_norm 0.5720 ± 0.0314
- leaderboard_bbh_logical_deduction_seven_objects 1.0 none 3 acc_norm 0.5320 ± 0.0316
- leaderboard_bbh_logical_deduction_three_objects 1.0 none 3 acc_norm 0.7720 ± 0.0266
- leaderboard_bbh_movie_recommendation 1.0 none 3 acc_norm 0.8600 ± 0.0220
- leaderboard_bbh_navigate 1.0 none 3 acc_norm 0.7320 ± 0.0281
- leaderboard_bbh_object_counting 1.0 none 3 acc_norm 0.4640 ± 0.0316
- leaderboard_bbh_penguins_in_a_table 1.0 none 3 acc_norm 0.6438 ± 0.0398
- leaderboard_bbh_reasoning_about_colored_objects 1.0 none 3 acc_norm 0.7720 ± 0.0266
- leaderboard_bbh_ruin_names 1.0 none 3 acc_norm 0.8360 ± 0.0235
- leaderboard_bbh_salient_translation_error_detection 1.0 none 3 acc_norm 0.6560 ± 0.0301
- leaderboard_bbh_snarks 1.0 none 3 acc_norm 0.6124 ± 0.0366
- leaderboard_bbh_sports_understanding 1.0 none 3 acc_norm 0.7880 ± 0.0259
- leaderboard_bbh_temporal_sequences 1.0 none 3 acc_norm 0.9200 ± 0.0172
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1.0 none 3 acc_norm 0.2560 ± 0.0277
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1.0 none 3 acc_norm 0.1640 ± 0.0235
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1.0 none 3 acc_norm 0.3400 ± 0.0300
- leaderboard_bbh_web_of_lies 1.0 none 3 acc_norm 0.5520 ± 0.0315
- leaderboard_gpqa none acc_norm 0.4060 ± 0.0142
- leaderboard_gpqa_diamond 1.0 none 0 acc_norm 0.4293 ± 0.0353
- leaderboard_gpqa_extended 1.0 none 0 acc_norm 0.4048 ± 0.0210
- leaderboard_gpqa_main 1.0 none 0 acc_norm 0.3973 ± 0.0231
- leaderboard_ifeval 3.0 none 0 inst_level_loose_acc 0.5935 ± N/A
none 0 inst_level_strict_acc 0.5600 ± N/A
none 0 prompt_level_loose_acc 0.4399 ± 0.0214
none 0 prompt_level_strict_acc 0.3974 ± 0.0211
- leaderboard_math_hard none exact_match 0.3187 ± 0.0118
- leaderboard_math_algebra_hard 3.0 none 4 exact_match 0.5635 ± 0.0284
none 4 exact_match_original 0.5570 ± 0.0284
- leaderboard_math_counting_and_prob_hard 3.0 none 4 exact_match 0.2602 ± 0.0397
none 4 exact_match_original 0.2683 ± 0.0401
- leaderboard_math_geometry_hard 3.0 none 4 exact_match 0.1515 ± 0.0313
none 4 exact_match_original 0.1439 ± 0.0307
- leaderboard_math_intermediate_algebra_hard 3.0 none 4 exact_match 0.1250 ± 0.0198
none 4 exact_match_original 0.1286 ± 0.0200
- leaderboard_math_num_theory_hard 3.0 none 4 exact_match 0.3052 ± 0.0372
none 4 exact_match_original 0.3052 ± 0.0372
- leaderboard_math_prealgebra_hard 3.0 none 4 exact_match 0.5130 ± 0.0361
none 4 exact_match_original 0.5130 ± 0.0361
- leaderboard_math_precalculus_hard 3.0 none 4 exact_match 0.1185 ± 0.0279
none 4 exact_match_original 0.1111 ± 0.0271
- leaderboard_mmlu_pro 0.1 none 5 acc 0.5258 ± 0.0046
- leaderboard_musr none acc_norm 0.4603 ± 0.0180
- leaderboard_musr_murder_mysteries 1.0 none 0 acc_norm 0.5480 ± 0.0315
- leaderboard_musr_object_placements 1.0 none 0 acc_norm 0.3867 ± 0.0305
- leaderboard_musr_team_allocation 1.0 none 0 acc_norm 0.4480 ± 0.0315
Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5258 ± 0.0046
none acc_norm 0.5696 ± 0.0052
none exact_match 0.3187 ± 0.0118
none inst_level_loose_acc 0.5935 ± N/A
none inst_level_strict_acc 0.5600 ± N/A
none prompt_level_loose_acc 0.4399 ± 0.0214
none prompt_level_strict_acc 0.3974 ± 0.0211
- leaderboard_bbh none acc_norm 0.6178 ± 0.0059
- leaderboard_gpqa none acc_norm 0.4060 ± 0.0142
- leaderboard_math_hard none exact_match 0.3187 ± 0.0118
- leaderboard_musr none acc_norm 0.4603 ± 0.0180

Qwen/Qwen3.5-9B:

Tasks Version Filter n-shot Metric Value Stderr
leaderboard 1.0 none acc 0.5490 ± 0.0045
none acc_norm 0.5739 ± 0.0052
none exact_match 0.3965 ± 0.0128
none inst_level_loose_acc 0.6379 ± N/A
none inst_level_strict_acc 0.6163 ± N/A
none prompt_level_loose_acc 0.5083 ± 0.0215
none prompt_level_strict_acc 0.4806 ± 0.0215
- leaderboard_bbh none acc_norm 0.6190 ± 0.0058
- leaderboard_bbh_boolean_expressions 1.0 none 3 acc_norm 0.8400 ± 0.0232
- leaderboard_bbh_causal_judgement 1.0 none 3 acc_norm 0.5775 ± 0.0362
- leaderboard_bbh_date_understanding 1.0 none 3 acc_norm 0.7680 ± 0.0268
- leaderboard_bbh_disambiguation_qa 1.0 none 3 acc_norm 0.6200 ± 0.0308
- leaderboard_bbh_formal_fallacies 1.0 none 3 acc_norm 0.7200 ± 0.0285
- leaderboard_bbh_geometric_shapes 1.0 none 3 acc_norm 0.4840 ± 0.0317
- leaderboard_bbh_hyperbaton 1.0 none 3 acc_norm 0.6400 ± 0.0304
- leaderboard_bbh_logical_deduction_five_objects 1.0 none 3 acc_norm 0.5800 ± 0.0313
- leaderboard_bbh_logical_deduction_seven_objects 1.0 none 3 acc_norm 0.5320 ± 0.0316
- leaderboard_bbh_logical_deduction_three_objects 1.0 none 3 acc_norm 0.7680 ± 0.0268
- leaderboard_bbh_movie_recommendation 1.0 none 3 acc_norm 0.7960 ± 0.0255
- leaderboard_bbh_navigate 1.0 none 3 acc_norm 0.6480 ± 0.0303
- leaderboard_bbh_object_counting 1.0 none 3 acc_norm 0.4240 ± 0.0313
- leaderboard_bbh_penguins_in_a_table 1.0 none 3 acc_norm 0.6712 ± 0.0390
- leaderboard_bbh_reasoning_about_colored_objects 1.0 none 3 acc_norm 0.7880 ± 0.0259
- leaderboard_bbh_ruin_names 1.0 none 3 acc_norm 0.8560 ± 0.0222
- leaderboard_bbh_salient_translation_error_detection 1.0 none 3 acc_norm 0.6760 ± 0.0297
- leaderboard_bbh_snarks 1.0 none 3 acc_norm 0.6236 ± 0.0364
- leaderboard_bbh_sports_understanding 1.0 none 3 acc_norm 0.8040 ± 0.0252
- leaderboard_bbh_temporal_sequences 1.0 none 3 acc_norm 0.9240 ± 0.0168
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1.0 none 3 acc_norm 0.1880 ± 0.0248
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1.0 none 3 acc_norm 0.1520 ± 0.0228
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1.0 none 3 acc_norm 0.3000 ± 0.0290
- leaderboard_bbh_web_of_lies 1.0 none 3 acc_norm 0.4880 ± 0.0317
- leaderboard_gpqa none acc_norm 0.4446 ± 0.0144
- leaderboard_gpqa_diamond 1.0 none 0 acc_norm 0.4394 ± 0.0354
- leaderboard_gpqa_extended 1.0 none 0 acc_norm 0.4524 ± 0.0213
- leaderboard_gpqa_main 1.0 none 0 acc_norm 0.4375 ± 0.0235
- leaderboard_ifeval 3.0 none 0 inst_level_loose_acc 0.6379 ± N/A
none 0 inst_level_strict_acc 0.6163 ± N/A
none 0 prompt_level_loose_acc 0.5083 ± 0.0215
none 0 prompt_level_strict_acc 0.4806 ± 0.0215
- leaderboard_math_hard none exact_match 0.3965 ± 0.0128
- leaderboard_math_algebra_hard 3.0 none 4 exact_match 0.5798 ± 0.0282
none 4 exact_match_original 0.2638 ± 0.0252
- leaderboard_math_counting_and_prob_hard 3.0 none 4 exact_match 0.3496 ± 0.0432
none 4 exact_match_original 0.1220 ± 0.0296
- leaderboard_math_geometry_hard 3.0 none 4 exact_match 0.3182 ± 0.0407
none 4 exact_match_original 0.2045 ± 0.0352
- leaderboard_math_intermediate_algebra_hard 3.0 none 4 exact_match 0.1786 ± 0.0229
none 4 exact_match_original 0.1286 ± 0.0200
- leaderboard_math_num_theory_hard 3.0 none 4 exact_match 0.4740 ± 0.0404
none 4 exact_match_original 0.2078 ± 0.0328
- leaderboard_math_prealgebra_hard 3.0 none 4 exact_match 0.5078 ± 0.0361
none 4 exact_match_original 0.1762 ± 0.0275
- leaderboard_math_precalculus_hard 3.0 none 4 exact_match 0.3037 ± 0.0397
none 4 exact_match_original 0.2296 ± 0.0363
- leaderboard_mmlu_pro 0.1 none 5 acc 0.5490 ± 0.0045
- leaderboard_musr none acc_norm 0.4339 ± 0.0176
- leaderboard_musr_murder_mysteries 1.0 none 0 acc_norm 0.5440 ± 0.0316
- leaderboard_musr_object_placements 1.0 none 0 acc_norm 0.2891 ± 0.0284
- leaderboard_musr_team_allocation 1.0 none 0 acc_norm 0.4720 ± 0.0316
Groups Version Filter n-shot Metric Value Stderr
leaderboard 1 none acc 0.5490 ± 0.0045
none acc_norm 0.5739 ± 0.0052
none exact_match 0.3965 ± 0.0128
none inst_level_loose_acc 0.6379 ± N/A
none inst_level_strict_acc 0.6163 ± N/A
none prompt_level_loose_acc 0.5083 ± 0.0215
none prompt_level_strict_acc 0.4806 ± 0.0215
- leaderboard_bbh none acc_norm 0.6190 ± 0.0058
- leaderboard_gpqa none acc_norm 0.4446 ± 0.0144
- leaderboard_math_hard none exact_match 0.3965 ± 0.0128
- leaderboard_musr none acc_norm 0.4339 ± 0.0176

Benchmarked with vLLM using the following command:
lm_eval
--model local-completions
--model_args "base_url=http://localhost:8000/v1/completions,pretrained=Crownelius/Crow-9B-HERETIC,tensor_parallel_size=1,add_bos_token=true,trust_remote_code=true,max_length=24576,max_gen_toks=16384,tokenizer=Qwen/Qwen3.5-9B,enable_thinking=True"
--tasks leaderboard
--batch_size 1024
--seed 42

That`s right.I think it is worst than origin model when i using this crow 9b to code.Certainly,it is not a best practice to use such datasets including writing,coding etc.
A specific dataset can boost a part of ability example coding,a too diversity dataset will get worse.

That`s right.I think it is worst than origin model when i using this crow 9b to code.Certainly,it is not a best practice to use such datasets including writing,coding etc.
A specific dataset can boost a part of ability example coding,a too diversity dataset will get worse.

None of these benchmarks are code-related, and I wouldn't say the degradation is too much. These are just general tests to compare the model to the original.

Honestly this is normal with training with two different reasoning types. As long as it's not a huge drop, it's negligible

I mostly agree except the math task.

Sign up or log in to comment