Reproducibility
Hi! I'm trying to reproduce your results on IMO-AnswerBench and consistently getting higher accuracy (69–73%) than what you report in the paper.
I'm using your vLLM fork on H200 GPUs. Is it possible the numbers in the paper correspond to IMO-ProofBench rather than IMO-AnswerBench? Or is there something else I might be missing in my setup?
Hi @RoflanVglorius , that's super interesting, are you running on the full 400 question set? It's possible the verifier we used was dropping some of the more complicated ground truth expressions (I think this was just vanilla math-verify, but I'll take a deeper look). The results in the paper were for IMO-AnswerBench - would be very curious to see how it does on IMO-ProofBench but we didn't get a chance to run yet.
Hi @RoflanVglorius , that's super interesting, are you running on the full 400 question set?
Thank you for your answer. Yes, I’m running on the full set and using Gemini-3.1-Pro to verify the answers