arxiv:2604.11290

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

Published on Apr 13

· Submitted by

Lj V. Miranda on Apr 14

Language Technology Lab @University of Cambridge

Upvote

Authors:

Lester James V. Miranda ,

Abstract

Effective multilingual teacher models for synthetic data generation are identified through systematic evaluation of data quality metrics rather than model size alone, with findings showing that prompt diversity and response fluency better predict student model performance.

AI-generated summary

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

View arXiv page View PDF GitHub 0 Add to collection

Community

ljvmiranda921

Paper author Paper submitter about 1 hour ago

We systematically characterize what makes a good teacher model for multilingual synthetic data generation, generating over 1.4M SFT instances and training 100+ student models across six languages.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.11290

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 8

Browse 8 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.11290 in a Space README.md to link it from this page.