Benchmarks
Hello, I benchmarked this model and base Qwen3.5-9B using the lm evaluation harness with the leaderboard task, and it seems to get a worse score across all benches.
Crownelius/Crow-9B-HERETIC:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| leaderboard | 1.0 | none | acc | ↑ | 0.5258 | ± | 0.0046 | |
| none | acc_norm | ↑ | 0.5696 | ± | 0.0052 | |||
| none | exact_match | ↑ | 0.3187 | ± | 0.0118 | |||
| none | inst_level_loose_acc | ↑ | 0.5935 | ± | N/A | |||
| none | inst_level_strict_acc | ↑ | 0.5600 | ± | N/A | |||
| none | prompt_level_loose_acc | ↑ | 0.4399 | ± | 0.0214 | |||
| none | prompt_level_strict_acc | ↑ | 0.3974 | ± | 0.0211 | |||
| - leaderboard_bbh | none | acc_norm | ↑ | 0.6178 | ± | 0.0059 | ||
| - leaderboard_bbh_boolean_expressions | 1.0 | none | 3 | acc_norm | ↑ | 0.8440 | ± | 0.0230 |
| - leaderboard_bbh_causal_judgement | 1.0 | none | 3 | acc_norm | ↑ | 0.5615 | ± | 0.0364 |
| - leaderboard_bbh_date_understanding | 1.0 | none | 3 | acc_norm | ↑ | 0.7240 | ± | 0.0283 |
| - leaderboard_bbh_disambiguation_qa | 1.0 | none | 3 | acc_norm | ↑ | 0.6040 | ± | 0.0310 |
| - leaderboard_bbh_formal_fallacies | 1.0 | none | 3 | acc_norm | ↑ | 0.4640 | ± | 0.0316 |
| - leaderboard_bbh_geometric_shapes | 1.0 | none | 3 | acc_norm | ↑ | 0.4960 | ± | 0.0317 |
| - leaderboard_bbh_hyperbaton | 1.0 | none | 3 | acc_norm | ↑ | 0.6560 | ± | 0.0301 |
| - leaderboard_bbh_logical_deduction_five_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.5720 | ± | 0.0314 |
| - leaderboard_bbh_logical_deduction_seven_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.5320 | ± | 0.0316 |
| - leaderboard_bbh_logical_deduction_three_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.7720 | ± | 0.0266 |
| - leaderboard_bbh_movie_recommendation | 1.0 | none | 3 | acc_norm | ↑ | 0.8600 | ± | 0.0220 |
| - leaderboard_bbh_navigate | 1.0 | none | 3 | acc_norm | ↑ | 0.7320 | ± | 0.0281 |
| - leaderboard_bbh_object_counting | 1.0 | none | 3 | acc_norm | ↑ | 0.4640 | ± | 0.0316 |
| - leaderboard_bbh_penguins_in_a_table | 1.0 | none | 3 | acc_norm | ↑ | 0.6438 | ± | 0.0398 |
| - leaderboard_bbh_reasoning_about_colored_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.7720 | ± | 0.0266 |
| - leaderboard_bbh_ruin_names | 1.0 | none | 3 | acc_norm | ↑ | 0.8360 | ± | 0.0235 |
| - leaderboard_bbh_salient_translation_error_detection | 1.0 | none | 3 | acc_norm | ↑ | 0.6560 | ± | 0.0301 |
| - leaderboard_bbh_snarks | 1.0 | none | 3 | acc_norm | ↑ | 0.6124 | ± | 0.0366 |
| - leaderboard_bbh_sports_understanding | 1.0 | none | 3 | acc_norm | ↑ | 0.7880 | ± | 0.0259 |
| - leaderboard_bbh_temporal_sequences | 1.0 | none | 3 | acc_norm | ↑ | 0.9200 | ± | 0.0172 |
| - leaderboard_bbh_tracking_shuffled_objects_five_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.2560 | ± | 0.0277 |
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.1640 | ± | 0.0235 |
| - leaderboard_bbh_tracking_shuffled_objects_three_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.3400 | ± | 0.0300 |
| - leaderboard_bbh_web_of_lies | 1.0 | none | 3 | acc_norm | ↑ | 0.5520 | ± | 0.0315 |
| - leaderboard_gpqa | none | acc_norm | ↑ | 0.4060 | ± | 0.0142 | ||
| - leaderboard_gpqa_diamond | 1.0 | none | 0 | acc_norm | ↑ | 0.4293 | ± | 0.0353 |
| - leaderboard_gpqa_extended | 1.0 | none | 0 | acc_norm | ↑ | 0.4048 | ± | 0.0210 |
| - leaderboard_gpqa_main | 1.0 | none | 0 | acc_norm | ↑ | 0.3973 | ± | 0.0231 |
| - leaderboard_ifeval | 3.0 | none | 0 | inst_level_loose_acc | ↑ | 0.5935 | ± | N/A |
| none | 0 | inst_level_strict_acc | ↑ | 0.5600 | ± | N/A | ||
| none | 0 | prompt_level_loose_acc | ↑ | 0.4399 | ± | 0.0214 | ||
| none | 0 | prompt_level_strict_acc | ↑ | 0.3974 | ± | 0.0211 | ||
| - leaderboard_math_hard | none | exact_match | ↑ | 0.3187 | ± | 0.0118 | ||
| - leaderboard_math_algebra_hard | 3.0 | none | 4 | exact_match | ↑ | 0.5635 | ± | 0.0284 |
| none | 4 | exact_match_original | ↑ | 0.5570 | ± | 0.0284 | ||
| - leaderboard_math_counting_and_prob_hard | 3.0 | none | 4 | exact_match | ↑ | 0.2602 | ± | 0.0397 |
| none | 4 | exact_match_original | ↑ | 0.2683 | ± | 0.0401 | ||
| - leaderboard_math_geometry_hard | 3.0 | none | 4 | exact_match | ↑ | 0.1515 | ± | 0.0313 |
| none | 4 | exact_match_original | ↑ | 0.1439 | ± | 0.0307 | ||
| - leaderboard_math_intermediate_algebra_hard | 3.0 | none | 4 | exact_match | ↑ | 0.1250 | ± | 0.0198 |
| none | 4 | exact_match_original | ↑ | 0.1286 | ± | 0.0200 | ||
| - leaderboard_math_num_theory_hard | 3.0 | none | 4 | exact_match | ↑ | 0.3052 | ± | 0.0372 |
| none | 4 | exact_match_original | ↑ | 0.3052 | ± | 0.0372 | ||
| - leaderboard_math_prealgebra_hard | 3.0 | none | 4 | exact_match | ↑ | 0.5130 | ± | 0.0361 |
| none | 4 | exact_match_original | ↑ | 0.5130 | ± | 0.0361 | ||
| - leaderboard_math_precalculus_hard | 3.0 | none | 4 | exact_match | ↑ | 0.1185 | ± | 0.0279 |
| none | 4 | exact_match_original | ↑ | 0.1111 | ± | 0.0271 | ||
| - leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.5258 | ± | 0.0046 |
| - leaderboard_musr | none | acc_norm | ↑ | 0.4603 | ± | 0.0180 | ||
| - leaderboard_musr_murder_mysteries | 1.0 | none | 0 | acc_norm | ↑ | 0.5480 | ± | 0.0315 |
| - leaderboard_musr_object_placements | 1.0 | none | 0 | acc_norm | ↑ | 0.3867 | ± | 0.0305 |
| - leaderboard_musr_team_allocation | 1.0 | none | 0 | acc_norm | ↑ | 0.4480 | ± | 0.0315 |
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| leaderboard | 1 | none | acc | ↑ | 0.5258 | ± | 0.0046 | |
| none | acc_norm | ↑ | 0.5696 | ± | 0.0052 | |||
| none | exact_match | ↑ | 0.3187 | ± | 0.0118 | |||
| none | inst_level_loose_acc | ↑ | 0.5935 | ± | N/A | |||
| none | inst_level_strict_acc | ↑ | 0.5600 | ± | N/A | |||
| none | prompt_level_loose_acc | ↑ | 0.4399 | ± | 0.0214 | |||
| none | prompt_level_strict_acc | ↑ | 0.3974 | ± | 0.0211 | |||
| - leaderboard_bbh | none | acc_norm | ↑ | 0.6178 | ± | 0.0059 | ||
| - leaderboard_gpqa | none | acc_norm | ↑ | 0.4060 | ± | 0.0142 | ||
| - leaderboard_math_hard | none | exact_match | ↑ | 0.3187 | ± | 0.0118 | ||
| - leaderboard_musr | none | acc_norm | ↑ | 0.4603 | ± | 0.0180 |
Qwen/Qwen3.5-9B:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| leaderboard | 1.0 | none | acc | ↑ | 0.5490 | ± | 0.0045 | |
| none | acc_norm | ↑ | 0.5739 | ± | 0.0052 | |||
| none | exact_match | ↑ | 0.3965 | ± | 0.0128 | |||
| none | inst_level_loose_acc | ↑ | 0.6379 | ± | N/A | |||
| none | inst_level_strict_acc | ↑ | 0.6163 | ± | N/A | |||
| none | prompt_level_loose_acc | ↑ | 0.5083 | ± | 0.0215 | |||
| none | prompt_level_strict_acc | ↑ | 0.4806 | ± | 0.0215 | |||
| - leaderboard_bbh | none | acc_norm | ↑ | 0.6190 | ± | 0.0058 | ||
| - leaderboard_bbh_boolean_expressions | 1.0 | none | 3 | acc_norm | ↑ | 0.8400 | ± | 0.0232 |
| - leaderboard_bbh_causal_judgement | 1.0 | none | 3 | acc_norm | ↑ | 0.5775 | ± | 0.0362 |
| - leaderboard_bbh_date_understanding | 1.0 | none | 3 | acc_norm | ↑ | 0.7680 | ± | 0.0268 |
| - leaderboard_bbh_disambiguation_qa | 1.0 | none | 3 | acc_norm | ↑ | 0.6200 | ± | 0.0308 |
| - leaderboard_bbh_formal_fallacies | 1.0 | none | 3 | acc_norm | ↑ | 0.7200 | ± | 0.0285 |
| - leaderboard_bbh_geometric_shapes | 1.0 | none | 3 | acc_norm | ↑ | 0.4840 | ± | 0.0317 |
| - leaderboard_bbh_hyperbaton | 1.0 | none | 3 | acc_norm | ↑ | 0.6400 | ± | 0.0304 |
| - leaderboard_bbh_logical_deduction_five_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.5800 | ± | 0.0313 |
| - leaderboard_bbh_logical_deduction_seven_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.5320 | ± | 0.0316 |
| - leaderboard_bbh_logical_deduction_three_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.7680 | ± | 0.0268 |
| - leaderboard_bbh_movie_recommendation | 1.0 | none | 3 | acc_norm | ↑ | 0.7960 | ± | 0.0255 |
| - leaderboard_bbh_navigate | 1.0 | none | 3 | acc_norm | ↑ | 0.6480 | ± | 0.0303 |
| - leaderboard_bbh_object_counting | 1.0 | none | 3 | acc_norm | ↑ | 0.4240 | ± | 0.0313 |
| - leaderboard_bbh_penguins_in_a_table | 1.0 | none | 3 | acc_norm | ↑ | 0.6712 | ± | 0.0390 |
| - leaderboard_bbh_reasoning_about_colored_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.7880 | ± | 0.0259 |
| - leaderboard_bbh_ruin_names | 1.0 | none | 3 | acc_norm | ↑ | 0.8560 | ± | 0.0222 |
| - leaderboard_bbh_salient_translation_error_detection | 1.0 | none | 3 | acc_norm | ↑ | 0.6760 | ± | 0.0297 |
| - leaderboard_bbh_snarks | 1.0 | none | 3 | acc_norm | ↑ | 0.6236 | ± | 0.0364 |
| - leaderboard_bbh_sports_understanding | 1.0 | none | 3 | acc_norm | ↑ | 0.8040 | ± | 0.0252 |
| - leaderboard_bbh_temporal_sequences | 1.0 | none | 3 | acc_norm | ↑ | 0.9240 | ± | 0.0168 |
| - leaderboard_bbh_tracking_shuffled_objects_five_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.1880 | ± | 0.0248 |
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.1520 | ± | 0.0228 |
| - leaderboard_bbh_tracking_shuffled_objects_three_objects | 1.0 | none | 3 | acc_norm | ↑ | 0.3000 | ± | 0.0290 |
| - leaderboard_bbh_web_of_lies | 1.0 | none | 3 | acc_norm | ↑ | 0.4880 | ± | 0.0317 |
| - leaderboard_gpqa | none | acc_norm | ↑ | 0.4446 | ± | 0.0144 | ||
| - leaderboard_gpqa_diamond | 1.0 | none | 0 | acc_norm | ↑ | 0.4394 | ± | 0.0354 |
| - leaderboard_gpqa_extended | 1.0 | none | 0 | acc_norm | ↑ | 0.4524 | ± | 0.0213 |
| - leaderboard_gpqa_main | 1.0 | none | 0 | acc_norm | ↑ | 0.4375 | ± | 0.0235 |
| - leaderboard_ifeval | 3.0 | none | 0 | inst_level_loose_acc | ↑ | 0.6379 | ± | N/A |
| none | 0 | inst_level_strict_acc | ↑ | 0.6163 | ± | N/A | ||
| none | 0 | prompt_level_loose_acc | ↑ | 0.5083 | ± | 0.0215 | ||
| none | 0 | prompt_level_strict_acc | ↑ | 0.4806 | ± | 0.0215 | ||
| - leaderboard_math_hard | none | exact_match | ↑ | 0.3965 | ± | 0.0128 | ||
| - leaderboard_math_algebra_hard | 3.0 | none | 4 | exact_match | ↑ | 0.5798 | ± | 0.0282 |
| none | 4 | exact_match_original | ↑ | 0.2638 | ± | 0.0252 | ||
| - leaderboard_math_counting_and_prob_hard | 3.0 | none | 4 | exact_match | ↑ | 0.3496 | ± | 0.0432 |
| none | 4 | exact_match_original | ↑ | 0.1220 | ± | 0.0296 | ||
| - leaderboard_math_geometry_hard | 3.0 | none | 4 | exact_match | ↑ | 0.3182 | ± | 0.0407 |
| none | 4 | exact_match_original | ↑ | 0.2045 | ± | 0.0352 | ||
| - leaderboard_math_intermediate_algebra_hard | 3.0 | none | 4 | exact_match | ↑ | 0.1786 | ± | 0.0229 |
| none | 4 | exact_match_original | ↑ | 0.1286 | ± | 0.0200 | ||
| - leaderboard_math_num_theory_hard | 3.0 | none | 4 | exact_match | ↑ | 0.4740 | ± | 0.0404 |
| none | 4 | exact_match_original | ↑ | 0.2078 | ± | 0.0328 | ||
| - leaderboard_math_prealgebra_hard | 3.0 | none | 4 | exact_match | ↑ | 0.5078 | ± | 0.0361 |
| none | 4 | exact_match_original | ↑ | 0.1762 | ± | 0.0275 | ||
| - leaderboard_math_precalculus_hard | 3.0 | none | 4 | exact_match | ↑ | 0.3037 | ± | 0.0397 |
| none | 4 | exact_match_original | ↑ | 0.2296 | ± | 0.0363 | ||
| - leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.5490 | ± | 0.0045 |
| - leaderboard_musr | none | acc_norm | ↑ | 0.4339 | ± | 0.0176 | ||
| - leaderboard_musr_murder_mysteries | 1.0 | none | 0 | acc_norm | ↑ | 0.5440 | ± | 0.0316 |
| - leaderboard_musr_object_placements | 1.0 | none | 0 | acc_norm | ↑ | 0.2891 | ± | 0.0284 |
| - leaderboard_musr_team_allocation | 1.0 | none | 0 | acc_norm | ↑ | 0.4720 | ± | 0.0316 |
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| leaderboard | 1 | none | acc | ↑ | 0.5490 | ± | 0.0045 | |
| none | acc_norm | ↑ | 0.5739 | ± | 0.0052 | |||
| none | exact_match | ↑ | 0.3965 | ± | 0.0128 | |||
| none | inst_level_loose_acc | ↑ | 0.6379 | ± | N/A | |||
| none | inst_level_strict_acc | ↑ | 0.6163 | ± | N/A | |||
| none | prompt_level_loose_acc | ↑ | 0.5083 | ± | 0.0215 | |||
| none | prompt_level_strict_acc | ↑ | 0.4806 | ± | 0.0215 | |||
| - leaderboard_bbh | none | acc_norm | ↑ | 0.6190 | ± | 0.0058 | ||
| - leaderboard_gpqa | none | acc_norm | ↑ | 0.4446 | ± | 0.0144 | ||
| - leaderboard_math_hard | none | exact_match | ↑ | 0.3965 | ± | 0.0128 | ||
| - leaderboard_musr | none | acc_norm | ↑ | 0.4339 | ± | 0.0176 |
Benchmarked with vLLM using the following command:
lm_eval
--model local-completions
--model_args "base_url=http://localhost:8000/v1/completions,pretrained=Crownelius/Crow-9B-HERETIC,tensor_parallel_size=1,add_bos_token=true,trust_remote_code=true,max_length=24576,max_gen_toks=16384,tokenizer=Qwen/Qwen3.5-9B,enable_thinking=True"
--tasks leaderboard
--batch_size 1024
--seed 42
That`s right.I think it is worst than origin model when i using this crow 9b to code.Certainly,it is not a best practice to use such datasets including writing,coding etc.
A specific dataset can boost a part of ability example coding,a too diversity dataset will get worse.
That`s right.I think it is worst than origin model when i using this crow 9b to code.Certainly,it is not a best practice to use such datasets including writing,coding etc.
A specific dataset can boost a part of ability example coding,a too diversity dataset will get worse.
None of these benchmarks are code-related, and I wouldn't say the degradation is too much. These are just general tests to compare the model to the original.
Honestly this is normal with training with two different reasoning types. As long as it's not a huge drop, it's negligible
I mostly agree except the math task.