Model Card for LabelPigeon

Model Details

Model Description

LabelPigeon is a machine translation model fine-tuned from NLLB-200 to jointly perform translation and cross-lingual label projection. By utilizing simple XML tags (e.g., <a>, <b>) to mark annotated spans in the source text, LabelPigeon translates the text while simultaneously transferring labels to the target language.

Developed by: Thennal D K, Chris Biemann, and Hans Ole Hatzel (Language Technology Group, University of Hamburg)
Model type: Sequence-to-Sequence Encoder-Decoder (Transformer)
Language(s) (NLP): Supports the 200+ languages of the NLLB base model, optimized for translating from English to diverse target languages.

Model Sources

Repository: https://github.com/thennal10/LabelPigeon
Paper: arXiv:2603.12021

Uses

Direct Use

LabelPigeon is designed to translate annotated datasets from one langauge to another. It is primarily trained on EN→XX and XX→EN translations, and is directly applicable for translating data for downstream NLP tasks that rely on span-level labels, including:

Named Entity Recognition (NER)
Extractive Question Answering (QA)
Coreference Resolution (CR)
Event Argument Extraction

A simple usage example is given below:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("thennal/nllb-200-3.3B-labelpigeon")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-3.3B")

# Prepare tagged input
tokenizer.src_lang = "eng_Latn"
source_text = "The <a>Eiffel Tower</a> is located in <b>Paris</b>, <c>France</c>."
inputs = tokenizer(source_text, return_tensors="pt")

# Translate with label projection, make sure to set the forced_bos_token_id to the target language
outputs = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"))
translated = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Training Data

The model was fine-tuned on a modified version of the Salesforce Localization XML MT dataset. The training mixture utilized parallel texts between English and three high-resource languages: German, Russian, and Chinese.

Original UI/formatting tags in the dataset were replaced with generic alphabetical tags (<a>, <b>, etc.) based on their order of appearance. Examples containing no tags were filtered out, resulting in roughly 150,000 parallel training samples.

Training Hyperparameters

Training regime: bfloat16 mixed precision
Learning rate: 1e-3
Batch size: 8 (with Gradient Accumulation steps = 2, Effective Batch Size = 16)
Scheduler: Inverse square root
Weight Decay: 0.01
Warmup: 5% of total steps
Epochs: 1 (9,091 steps)

Citation

@misc{k2026justusexmlrevisiting,
      title={Just Use XML: Revisiting Joint Translation and Label Projection}, 
      author={Thennal D K and Chris Biemann and Hans Ole Hatzel},
      year={2026},
      eprint={2603.12021},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.12021}, 
}

Downloads last month: 4

Safetensors

Model size

3B params

Tensor type

F32

Model tree for thennal/nllb-200-3.3B-labelpigeon

Base model

facebook/nllb-200-3.3B

Finetuned

(30)

this model

Paper for thennal/nllb-200-3.3B-labelpigeon

Just Use XML: Revisiting Joint Translation and Label Projection

Paper • 2603.12021 • Published Mar 12