COMBO-NLP Model for Turkish
Model Description
This is a Turkish-language model based on COMBO-NLP, an open-source natural language preprocessing system. It performs:
- sentence segmentation (via LAMBO)
- tokenisation (via LAMBO)
- part-of-speech tagging
- morphological analysis
- lemmatisation
- dependency parsing
The Turkish model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Turkish-FrameNet (UD v2.17).
Evaluation
Evaluation was performed on the UD_Turkish-FrameNet test split using the standard CoNLL 2018 eval script.
Two evaluation rows are reported:
- Full-text (F1): raw text is segmented by LAMBO, then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality.
- Aligned accuracy: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter.
Morphosyntactic Tagging
| Metric | Tokens | Sentences | Words | UPOS | XPOS | UFeats | AllTags | Lemmas |
|---|---|---|---|---|---|---|---|---|
| Full-text (F1) | 99.69 | 100.00 | 99.69 | 95.68 | 99.69 | 91.52 | 90.70 | 94.52 |
| Aligned accuracy | 0.00 | 0.00 | 0.00 | 95.97 | 100.00 | 91.80 | 90.98 | 94.81 |
Dependency Parsing
| Metric | UAS | LAS | CLAS | MLAS | BLEX |
|---|---|---|---|---|---|
| Full-text (F1) | 93.36 | 85.33 | 82.94 | 73.54 | 77.10 |
| Aligned accuracy | 93.65 | 85.59 | 83.17 | 73.74 | 77.31 |
Usage
Install the library from PyPI (assuming you have a virtual environment created):
pip install combo-nlp
Install the Lambo segmenter - only needed when passing raw text strings to COMBO:
pip install --index-url https://pypi.clarin-pl.eu/ lambo
from combo import COMBO
# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Turkish")
# Parse raw text (handles sentence splitting + tokenization)
result = nlp("Çevik kahverengi tilki tembel köpeğin üzerinden atlar.")
# Inspect results
for sentence in result:
for token in sentence:
print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head} {token.deprel}")
Refer to the COMBO-NLP documentation for installation and usage instructions:
- https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
- https://gitlab.clarin-pl.eu/syntactic-tools/lambo
License
The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:
Citation
If you use this model, please cite:
Ulewicz, M., Jabłońska, M., Klimaszewski, M., Przybyła, P., Pszenny, Ł., Rybak, P., Wiącek, M., & Wróblewska, A. (2026). COMBO-NLP Models Trained on UD v2.17. Zenodo. https://doi.org/10.5281/zenodo.19650523
@software{combo_nlp_2026,
author = {Ulewicz, Michał and Jabłońska, Maja and Klimaszewski, Mateusz and Przybyła, Piotr and Pszenny, Łukasz and Rybak, Piotr and Wiącek, Martyna and Wróblewska, Alina},
title = {{COMBO-NLP} Models Trained on {UD} v2.17},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19650523},
url = {https://doi.org/10.5281/zenodo.19650523}
}
Resources
- COMBO-NLP: https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
- LAMBO: https://gitlab.clarin-pl.eu/syntactic-tools/lambo
- UD_Turkish-FrameNet: https://github.com/UniversalDependencies/UD_Turkish-FrameNet
- Downloads last month
- 21