Safetensors
qwen3

Substantially lower accuracy on reasoning benchmarks such as GSM8K (1.5%) and MATH-500 (4.2%)

#1
by jvonrad - opened

I've evaluated the model on reasoning benchmarks, and found that it seems to have lost its reasoning ability entirely.

Boomerang-Qwen3-4.9B:
GSM8K (4-shot) | MATH-500 (4-shot) | GPQA-Diamond (5-shot)
1.5 % | 4.2 % | 25 % (=random choice out of 4 mc)

Qwen-3-8B-Base:
GSM8K (4-shot). | MATH-500 (4-shot) | GPQA-Diamond (5-shot)
85.4 % | 54.48 % | 43.94 %

I used light_eval and also evaluated the Qwen3-8B-base model under the same settings to be sure its not because of missing instruction tuning or chat templates.

Harvard Data-Centric Machine Learning Group org

Thank you for trying out our model! Your results seem correct.
Harvard-DCML/boomerang-qwen3-4.9B model (as well as our other distilled models) have only been distilled on 2B tokens of The Pile, and have not been explicitly trained to recover reasoning capabilities. As a result, while exhibiting strong performance on natural language tasks (cmp. Appendix K.2, Figures 27-30 in the paper), the model struggles on generative reasoning tasks (cmp. Appendix K.3, Figures 31-34, where we also report accuracy on GSM8K matching your findings). These appendices also show that as more student layers are patched with teacher layers, the performance of the model on reasoning tasks increases. We see similar results when reproducing your setting of gpqa_diamond_n_shot in lm-evaluation-harness and interpolating between Harvard-DCML/boomerang-qwen3-4.9B and Qwen/Qwen3-8B-Base:

Layers Patched gpqa_diamond_n_shot Exact Match Accuracy
0 (Harvard-DCML/boomerang-qwen3-4.9B) 0.217
5 0.278
10 0.298
15 0.379
19 (Qwen/Qwen3-8B-Base) 0.404

In order to retain reasoning capabilities in the student model in model distillation, it is necessary to adapt the training accordingly (e.g., train on reasoning datasets, teacher generated reasoning traces, etc.), which goes beyond the scope of our paper.

Sign up or log in to comment