--- license: cc-by-sa-4.0 datasets: - procesaur/ZNANJE - procesaur/STARS - procesaur/Vikipedija - procesaur/Vikizvornik - jerteh/SrpELTeC - procesaur/kisobran language: - sr - hr base_model: - FacebookAI/xlm-roberta-base ---

XLMali

Вишејезични модел, 279 милиона параметара

Обучаван над корпусима српског и српскохрватског језика - 20 милијарди речи

Једнака подршка уноса на ћирилици и латиници!

Multilingual model, 279 million parameters

Trained on Serbian and Serbo-Croatian corpora - 20 billion words

Equal support for Cyrillic and Latin input!

```python >>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='te-sla/teslaXLM') >>> unmasker("Kada bi čovek znao gde će pasti on bi.") ``` ```python >>> from transformers import AutoTokenizer, AutoModelForMaskedLM >>> from torch import LongTensor, no_grad >>> from scipy import spatial >>> tokenizer = AutoTokenizer.from_pretrained('te-sla/teslaXLM') >>> model = AutoModelForMaskedLM.from_pretrained('te-sla/teslaXLM', output_hidden_states=True) >>> x = " pas" >>> y = " mačka" >>> z = " svemir" >>> tensor_x = LongTensor(tokenizer.encode(x, add_special_tokens=False)).unsqueeze(0) >>> tensor_y = LongTensor(tokenizer.encode(y, add_special_tokens=False)).unsqueeze(0) >>> tensor_z = LongTensor(tokenizer.encode(z, add_special_tokens=False)).unsqueeze(0) >>> model.eval() >>> with no_grad(): >>> vektor_x = model(input_ids=tensor_x).hidden_states[-1].squeeze() >>> vektor_y = model(input_ids=tensor_y).hidden_states[-1].squeeze() >>> vektor_z = model(input_ids=tensor_z).hidden_states[-1].squeeze() >>> print(spatial.distance.cosine(vektor_x, vektor_y)) >>> print(spatial.distance.cosine(vektor_x, vektor_z)) ```
Евалуација XLMR-base модела за српски језик
Serbian XLMR-base models evaluation results
Author
Mihailo Škorić
@procesaur
Author
Saša Petalinkar
@tanor
Computation
TESLA project
@te-sla

## Cit. ```bibtex @incollection{skoric2025:juznoslovenskijezici, author = {Škorić, Mihailo and Petalinkar, Saša}, orcid = {0000-0003-4811-8692 and 0009-0007-9664-3594}, title = {Quality Textual Corpora and New South Slavic Language Models}, license = {https://creativecommons.org/licenses/by/4.0/}, booktitle = {Proceedings of the International Conference South Slavic Languages in the Digital Environment JuDig : Thematic Collection of Papers}, editor = {Moskovljević Popović, Jasmina and Stanković, Ranka}, isbn = {978-86-6153-791-2}, series = {South Slavic Languages in the Digital Environment JuDig}, publisher = {University of Belgrade — Faculty of Philology}, address = {Belgrade}, year = {2025}, volume = {1}, pages = {337--348}, note = {19}, doi = {10.18485/judig.2025.1.ch19}, doiurl = {http://doi.fil.bg.ac.rs/volume.php?pt=eb_ser&issue=judig-2025-1&i=19}, url = {http://doi.fil.bg.ac.rs/pdf/eb_ser/judig/2025-1/judig-2025-1-ch19.pdf} } ```

Истраживање jе спроведено уз подршку Фонда за науку Републике Србиjе, #7276, Text Embeddings – Serbian Language Applications – TESLA

This research was supported by the Science Fund of the Republic of Serbia, #7276, Text Embeddings - Serbian Language Applications - TESLA