Reproducibility

#12

by RoflanVglorius - opened 14 days ago

Hi! I'm trying to reproduce your results on IMO-AnswerBench and consistently getting higher accuracy (69–73%) than what you report in the paper.
I'm using your vLLM fork on H200 GPUs. Is it possible the numbers in the paper correspond to IMO-ProofBench rather than IMO-AnswerBench? Or is there something else I might be missing in my setup?

rawsh

Zyphra org 12 days ago

Hi @RoflanVglorius , that's super interesting, are you running on the full 400 question set? It's possible the verifier we used was dropping some of the more complicated ground truth expressions (I think this was just vanilla math-verify, but I'll take a deeper look). The results in the paper were for IMO-AnswerBench - would be very curious to see how it does on IMO-ProofBench but we didn't get a chance to run yet.

RoflanVglorius

12 days ago

Hi @RoflanVglorius , that's super interesting, are you running on the full 400 question set?

Thank you for your answer. Yes, I’m running on the full set and using Gemini-3.1-Pro to verify the answers

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment