XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.

This model is fine tuned with The Latin Library - 15M Token

The dataset was cleaned:

  • Removal of all "pseudo-Latin" text ("Lorem ipsum ...").
  • Use of CLTK for sentence splitting and normalisation.
  • deduplication of the corpus
  • lowercase all text
Downloads last month
17
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Cicciokr/XLM-Roberta-Base-Latin-Uncased

Finetuned
(3884)
this model