Safetensors
zaya
Eval Results

Reproducibility

#12
by RoflanVglorius - opened

Hi! I'm trying to reproduce your results on IMO-AnswerBench and consistently getting higher accuracy (69–73%) than what you report in the paper.
I'm using your vLLM fork on H200 GPUs. Is it possible the numbers in the paper correspond to IMO-ProofBench rather than IMO-AnswerBench? Or is there something else I might be missing in my setup?

Zyphra org

Hi @RoflanVglorius , that's super interesting, are you running on the full 400 question set? It's possible the verifier we used was dropping some of the more complicated ground truth expressions (I think this was just vanilla math-verify, but I'll take a deeper look). The results in the paper were for IMO-AnswerBench - would be very curious to see how it does on IMO-ProofBench but we didn't get a chance to run yet.

Hi @RoflanVglorius , that's super interesting, are you running on the full 400 question set?

Thank you for your answer. Yes, I’m running on the full set and using Gemini-3.1-Pro to verify the answers

Sign up or log in to comment