Loading tokenizer: /tmp/eval/multilingual_32k.model Loading model: /tmp/eval/best_model.pt Model loaded: 3.04B parameters on cuda ============================================================ BELEBELE EVALUATION — Multilingual 3B GPT ============================================================ Evaluating EN (eng_Latn)... [EN] 50/900 — accuracy so far: 22.0% [EN] 100/900 — accuracy so far: 29.0% [EN] 150/900 — accuracy so far: 32.7% [EN] 200/900 — accuracy so far: 30.5% [EN] 250/900 — accuracy so far: 32.0% [EN] 300/900 — accuracy so far: 31.7% [EN] 350/900 — accuracy so far: 32.9% [EN] 400/900 — accuracy so far: 33.5% [EN] 450/900 — accuracy so far: 32.9% [EN] 500/900 — accuracy so far: 31.4% [EN] 550/900 — accuracy so far: 32.0% [EN] 600/900 — accuracy so far: 32.2% [EN] 650/900 — accuracy so far: 32.6% [EN] 700/900 — accuracy so far: 32.3% [EN] 750/900 — accuracy so far: 32.7% [EN] 800/900 — accuracy so far: 32.0% [EN] 850/900 — accuracy so far: 31.9% [EN] 900/900 — accuracy so far: 31.8% ✅ EN: 31.8% (286/900) Evaluating HE (heb_Hebr)... [HE] 50/900 — accuracy so far: 20.0% [HE] 100/900 — accuracy so far: 25.0% [HE] 150/900 — accuracy so far: 24.0% [HE] 200/900 — accuracy so far: 26.0% [HE] 250/900 — accuracy so far: 25.2% [HE] 300/900 — accuracy so far: 25.7% [HE] 350/900 — accuracy so far: 24.9% [HE] 400/900 — accuracy so far: 24.8% [HE] 450/900 — accuracy so far: 24.9% [HE] 500/900 — accuracy so far: 24.2% [HE] 550/900 — accuracy so far: 25.1% [HE] 600/900 — accuracy so far: 25.3% [HE] 650/900 — accuracy so far: 25.7% [HE] 700/900 — accuracy so far: 25.4% [HE] 750/900 — accuracy so far: 25.9% [HE] 800/900 — accuracy so far: 26.2% [HE] 850/900 — accuracy so far: 26.8% [HE] 900/900 — accuracy so far: 27.0% ✅ HE: 27.0% (243/900) Evaluating AR (arb_Arab)... [AR] 50/900 — accuracy so far: 28.0% [AR] 100/900 — accuracy so far: 25.0% [AR] 150/900 — accuracy so far: 24.7% [AR] 200/900 — accuracy so far: 29.5% [AR] 250/900 — accuracy so far: 30.8% [AR] 300/900 — accuracy so far: 30.0% [AR] 350/900 — accuracy so far: 28.6% [AR] 400/900 — accuracy so far: 28.2% [AR] 450/900 — accuracy so far: 28.7% [AR] 500/900 — accuracy so far: 27.2% [AR] 550/900 — accuracy so far: 27.5% [AR] 600/900 — accuracy so far: 27.0% [AR] 650/900 — accuracy so far: 27.7% [AR] 700/900 — accuracy so far: 28.1% [AR] 750/900 — accuracy so far: 28.9% [AR] 800/900 — accuracy so far: 29.1% [AR] 850/900 — accuracy so far: 28.9% [AR] 900/900 — accuracy so far: 28.4% ✅ AR: 28.4% (256/900) Evaluating FA (pes_Arab)... [FA] 50/900 — accuracy so far: 32.0% [FA] 100/900 — accuracy so far: 33.0% [FA] 150/900 — accuracy so far: 30.7% [FA] 200/900 — accuracy so far: 30.5% [FA] 250/900 — accuracy so far: 28.4% [FA] 300/900 — accuracy so far: 29.0% [FA] 350/900 — accuracy so far: 29.4% [FA] 400/900 — accuracy so far: 30.0% [FA] 450/900 — accuracy so far: 30.2% [FA] 500/900 — accuracy so far: 30.8% [FA] 550/900 — accuracy so far: 30.7% [FA] 600/900 — accuracy so far: 30.3% [FA] 650/900 — accuracy so far: 29.5% [FA] 700/900 — accuracy so far: 29.4% [FA] 750/900 — accuracy so far: 28.4% [FA] 800/900 — accuracy so far: 27.8% [FA] 850/900 — accuracy so far: 28.0% [FA] 900/900 — accuracy so far: 28.2% ✅ FA: 28.2% (254/900) ============================================================ OVERALL: 28.9% (1039/3600) Random baseline: 25.0% ============================================================ Results saved to /tmp/eval/belebele_results.json