COMBO-NLP Model for Icelandic
Model Description
This is a Icelandic-language model based on COMBO-NLP, an open-source natural language preprocessing system. It performs:
- sentence segmentation (via LAMBO)
- tokenisation (via LAMBO)
- part-of-speech tagging
- morphological analysis
- lemmatisation
- dependency parsing
The Icelandic model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Icelandic-IcePaHC (UD v2.17).
Evaluation
Evaluation was performed on the UD_Icelandic-IcePaHC test split using the standard CoNLL 2018 eval script.
Two evaluation rows are reported:
- Full-text (F1): raw text is segmented by LAMBO, then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality.
- Aligned accuracy: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter.
Morphosyntactic Tagging
| Metric | Tokens | Sentences | Words | UPOS | XPOS | UFeats | AllTags | Lemmas |
|---|---|---|---|---|---|---|---|---|
| Full-text (F1) | 99.77 | 94.33 | 99.71 | 97.00 | 93.78 | 90.62 | 86.51 | 96.55 |
| Aligned accuracy | 0.00 | 0.00 | 0.00 | 97.28 | 94.05 | 90.88 | 86.76 | 96.82 |
Dependency Parsing
| Metric | UAS | LAS | CLAS | MLAS | BLEX |
|---|---|---|---|---|---|
| Full-text (F1) | 88.52 | 85.16 | 80.38 | 67.52 | 77.29 |
| Aligned accuracy | 88.77 | 85.40 | 80.57 | 67.68 | 77.47 |
Usage
Install the library from PyPI (assuming you have a virtual environment created):
pip install combo-nlp
Install the Lambo segmenter - only needed when passing raw text strings to COMBO:
pip install --index-url https://pypi.clarin-pl.eu/ lambo
from combo import COMBO
# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Icelandic")
# Parse raw text (handles sentence splitting + tokenization)
result = nlp("Fljóti brúni refurinn hleypur yfir lata hundinn.")
# Inspect results
for sentence in result:
for token in sentence:
print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head} {token.deprel}")
Refer to the COMBO-NLP documentation for installation and usage instructions:
- https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
- https://gitlab.clarin-pl.eu/syntactic-tools/lambo
License
The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:
Citation
If you use this model, please cite:
Ulewicz, M., Jabłońska, M., Klimaszewski, M., Przybyła, P., Pszenny, Ł., Rybak, P., Wiącek, M., & Wróblewska, A. (2026). COMBO-NLP Models Trained on UD v2.17. Zenodo. https://doi.org/10.5281/zenodo.19650523
@software{combo_nlp_2026,
author = {Ulewicz, Michał and Jabłońska, Maja and Klimaszewski, Mateusz and Przybyła, Piotr and Pszenny, Łukasz and Rybak, Piotr and Wiącek, Martyna and Wróblewska, Alina},
title = {{COMBO-NLP} Models Trained on {UD} v2.17},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19650523},
url = {https://doi.org/10.5281/zenodo.19650523}
}
Treebank References
If you use the Icelandic IcePaHC treebank data, please also cite:
@inproceedings{arnardottir-etal-2020-universal,
title = "A {U}niversal {D}ependencies Conversion Pipeline for a {P}enn-format Constituency Treebank",
author = "Arnard{\'o}ttir, {\TH}{\'o}runn and
Hafsteinsson, Hinrik and
Sigur{\dh}sson, Einar Freyr and
Bjarnad{\'o}ttir, Krist{\'\i}n and
Ingason, Anton Karl and
J{\'o}nsd{\'o}ttir, Hildur and
Steingr{\'\i}msson, Stein{\th}{\'o}r",
booktitle = "Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.udw-1.3",
pages = "16--25",
abstract = "The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.",
}
@inproceedings{arnardottir-etal-2023-evaluating,
title = "Evaluating a {U}niversal {D}ependencies Conversion Pipeline for {I}celandic",
author = "Arnard{\'o}ttir, {\TH}{\'o}runn and
Hafsteinsson, Hinrik and
Jasonarson, Atli and
Ingason, Anton and
Steingr{\'\i}msson, Stein{\th}{\'o}r",
editor = {Alum{\"a}e, Tanel and
Fishel, Mark},
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = may,
year = "2023",
address = "T{\'o}rshavn, Faroe Islands",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2023.nodalida-1.69",
pages = "698--704",
abstract = "We describe the evaluation and development of a rule-based treebank conversion tool, UDConverter, which converts treebanks from the constituency-based PPCHE annotation scheme to the dependency-based Universal Dependencies (UD) scheme. The tool has already been used in the production of three UD treebanks, although no formal evaluation of the tool has been carried out as of yet. By manually correcting new output files from the converter and comparing them to the raw output, we measured the labeled attachment score (LAS) and unlabeled attachment score (UAS) of the converted texts. We obtain an LAS of 82.87 and a UAS of 87.91. In comparison to other tools, UDConverter currently provides the best results in automatic UD treebank creation for Icelandic.",
}
Resources
- COMBO-NLP: https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
- LAMBO: https://gitlab.clarin-pl.eu/syntactic-tools/lambo
- UD_Icelandic-IcePaHC: https://github.com/UniversalDependencies/UD_Icelandic-IcePaHC
- Downloads last month
- 20