Model Card for LabelPigeon

Model Details

Model Description

LabelPigeon is a machine translation model fine-tuned from NLLB-200 to jointly perform translation and cross-lingual label projection. By utilizing simple XML tags (e.g., <a>, <b>) to mark annotated spans in the source text, LabelPigeon translates the text while simultaneously transferring labels to the target language.

  • Developed by: Thennal D K, Chris Biemann, and Hans Ole Hatzel (Language Technology Group, University of Hamburg)
  • Model type: Sequence-to-Sequence Encoder-Decoder (Transformer)
  • Language(s) (NLP): Supports the 200+ languages of the NLLB base model, optimized for translating from English to diverse target languages.

Model Sources

Uses

Direct Use

LabelPigeon is designed to translate annotated datasets from one langauge to another. It is primarily trained on EN→XX and XX→EN translations, and is directly applicable for translating data for downstream NLP tasks that rely on span-level labels, including:

  • Named Entity Recognition (NER)
  • Extractive Question Answering (QA)
  • Coreference Resolution (CR)
  • Event Argument Extraction

A simple usage example is given below:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("thennal/nllb-200-3.3B-labelpigeon")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-3.3B")

# Prepare tagged input
tokenizer.src_lang = "eng_Latn"
source_text = "The <a>Eiffel Tower</a> is located in <b>Paris</b>, <c>France</c>."
inputs = tokenizer(source_text, return_tensors="pt")

# Translate with label projection, make sure to set the forced_bos_token_id to the target language
outputs = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"))
translated = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Training Data

The model was fine-tuned on a modified version of the Salesforce Localization XML MT dataset. The training mixture utilized parallel texts between English and three high-resource languages: German, Russian, and Chinese.

Original UI/formatting tags in the dataset were replaced with generic alphabetical tags (<a>, <b>, etc.) based on their order of appearance. Examples containing no tags were filtered out, resulting in roughly 150,000 parallel training samples.

Training Hyperparameters

  • Training regime: bfloat16 mixed precision
  • Learning rate: 1e-3
  • Batch size: 8 (with Gradient Accumulation steps = 2, Effective Batch Size = 16)
  • Scheduler: Inverse square root
  • Weight Decay: 0.01
  • Warmup: 5% of total steps
  • Epochs: 1 (9,091 steps)

Citation

@misc{k2026justusexmlrevisiting,
      title={Just Use XML: Revisiting Joint Translation and Label Projection}, 
      author={Thennal D K and Chris Biemann and Hans Ole Hatzel},
      year={2026},
      eprint={2603.12021},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.12021}, 
}
Downloads last month
4
Safetensors
Model size
3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thennal/nllb-200-3.3B-labelpigeon

Finetuned
(30)
this model

Paper for thennal/nllb-200-3.3B-labelpigeon