Swiss AI Initiative

Team

university

https://www.swiss-ai.org/

swiss-ai

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

rikelue updated a dataset 19 days ago

swiss-ai/hallulens

rikelue new activity 19 days ago

swiss-ai/hallulens:Update precise wiki

rikelue published a dataset about 2 months ago

swiss-ai/harmbench_copyright_classifier_hashes

View all activity

Papers

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

View all Papers

rikelue

updated a dataset 19 days ago

swiss-ai/hallulens

Viewer • Updated 19 days ago • 14.6k • 117

rikelue

in swiss-ai/hallulens 19 days ago

Update precise wiki

#1 opened 19 days ago by

rikelue

published 4 datasets about 2 months ago

submitted a paper to Daily Papers about 2 months ago

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Paper • 2602.22207 • Published Feb 25 • 43

hannayukhymenko

posted an update about 2 months ago

Post

2019

Do you translate your benchmarks from English correctly? 🤔
Turns out, for many languages it is much harder than you can imagine!

Introducing Recovered in Translation 🌍 together with @aalexandrov
https://ritranslation.insait.ai

Translating benchmarks is a painful process, requiring a lot of manual inspection and adjustments. You start from setting up the whole pipeline and adapting to every format type, including task specifics. There already exist some massive benchmarks, but they still have some simple (and sometimes silly) bugs, which can hurt the evaluations :( We present a novel automated translation framework to help with that!

Eastern and Southern European languages introduce richer linguistic structures compared to English and for benchmarks which heavily rely on grammatical coherence machine translation presents a risk of harming evaluations. We discover potential answer leakage or misleading through grammatical structure of the questions. Some benchmarks are also just outdated and need to be retranslated with newer and better models.

We present a framework with novel test-time scaling methods which allow to control time and cost investments, while at the same time mitigate the need for human-in-the-loop verification. While working on Ukrainian-focused MamayLM models, we had to translate 10+ benchmarks in a short span of time. Finding human evaluators is costly and time-consuming, same goes for using professional translators. With our pipeline we were able to do it in 3 days🏎️

We hope our findings will help enable stronger multilingual evaluations and developments. We release all produced benchmarks on Hugging Face together with the source code and Arxiv paper 🤗

Paper: Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets (2602.22207)
Code: https://github.com/insait-institute/ritranslation
Benchmarks: https://huggingface.co/collections/INSAIT-Institute/multilingual-benchmarks