AIME 25 Accuracy Discrepancy for GPT-OSS-20B (Reasoning Effort=High)

#57

by jiayi37u - opened Aug 6, 2025

Aug 6, 2025

Thank you very much for open-sourcing such a powerful large language model.
I’ve noticed that the community has run into difficulties reproducing the results reported in your paper. On the AIME 25, the paper states that GPT-OSS-20B (without tools, with reasoning effort mode set to “high”) achieved an accuracy of 91.7%, whereas our reproduction using vLLM only reached 85.8%. Do you have any suggestions to help us replicate your evaluation results?

Reference link: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use

jiayi37u changed discussion status to closed Aug 6, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment