Model Card for LabelPigeon
Model Details
Model Description
LabelPigeon is a machine translation model fine-tuned from NLLB-200 to jointly perform translation and cross-lingual label projection. By utilizing simple XML tags (e.g., <a>, <b>) to mark annotated spans in the source text, LabelPigeon translates the text while simultaneously transferring labels to the target language.
- Developed by: Thennal D K, Chris Biemann, and Hans Ole Hatzel (Language Technology Group, University of Hamburg)
- Model type: Sequence-to-Sequence Encoder-Decoder (Transformer)
- Language(s) (NLP): Supports the 200+ languages of the NLLB base model, optimized for translating from English to diverse target languages.
Model Sources
- Repository: https://github.com/thennal10/LabelPigeon
- Paper: arXiv:2603.12021
Uses
Direct Use
LabelPigeon is designed to translate annotated datasets from one langauge to another. It is primarily trained on EN→XX and XX→EN translations, and is directly applicable for translating data for downstream NLP tasks that rely on span-level labels, including:
- Named Entity Recognition (NER)
- Extractive Question Answering (QA)
- Coreference Resolution (CR)
- Event Argument Extraction
A simple usage example is given below:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("thennal/nllb-200-3.3B-labelpigeon")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-3.3B")
# Prepare tagged input
tokenizer.src_lang = "eng_Latn"
source_text = "The <a>Eiffel Tower</a> is located in <b>Paris</b>, <c>France</c>."
inputs = tokenizer(source_text, return_tensors="pt")
# Translate with label projection, make sure to set the forced_bos_token_id to the target language
outputs = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"))
translated = tokenizer.decode(outputs[0], skip_special_tokens=True)
Training Details
Training Data
The model was fine-tuned on a modified version of the Salesforce Localization XML MT dataset. The training mixture utilized parallel texts between English and three high-resource languages: German, Russian, and Chinese.
Original UI/formatting tags in the dataset were replaced with generic alphabetical tags (<a>, <b>, etc.) based on their order of appearance. Examples containing no tags were filtered out, resulting in roughly 150,000 parallel training samples.
Training Hyperparameters
- Training regime:
bfloat16mixed precision - Learning rate: 1e-3
- Batch size: 8 (with Gradient Accumulation steps = 2, Effective Batch Size = 16)
- Scheduler: Inverse square root
- Weight Decay: 0.01
- Warmup: 5% of total steps
- Epochs: 1 (9,091 steps)
Citation
@misc{k2026justusexmlrevisiting,
title={Just Use XML: Revisiting Joint Translation and Label Projection},
author={Thennal D K and Chris Biemann and Hans Ole Hatzel},
year={2026},
eprint={2603.12021},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.12021},
}
- Downloads last month
- 4
Model tree for thennal/nllb-200-3.3B-labelpigeon
Base model
facebook/nllb-200-3.3B