Lower evaluation results

by MianchuWang - opened Dec 9, 2025

Dec 9, 2025

Dear Authors,

Thank you for your contribution to this research direction. I'm currently trying to reproduce the GSM8K results reported for Ouro 1.4B R4 and Ouro 2.6B R4, but I'm encountering some difficulties.

I ran the following evaluation code:

import lm_eval
results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=ByteDance/Ouro-1.4B,trust_remote_code=True,dtype=float32",
    tasks=["gsm8k_cot"],
    num_fewshot=3,
    batch_size=1,
    limit=50,
    device="cuda:0",
)

With this setup, I obtain ~0.5 accuracy for Ouro 1.4B and ~0.6 for Ouro 2.6B. May I ask whether there is anything incorrect in my configuration, or whether I am missing any additional steps required to replicate the reported results?

Thank you for your time and guidance.

KristianS7

Mar 9

Hi @MianchuWang ,

It may be too late for you, but for future reference: the main issue in your config is limit=50. Evaluating on only 50 samples introduces high variance. You need to remove the limit and run on the full dataset to get stable results.

Additionally, ensure NO chat template is applied to the prompts (as you already did) and exact match under flexible-extract should be reported.

With the full dataset and raw text formatting, I can reproduce all paper results with both vLLM and HF backends using the standard lm_eval settings.

Versions:

vllm: 0.16.0
transformers: 4.57.6
lm-eval: 0.4.11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment