COMBO-NLP Model for Chinese

Model Description

This is a Chinese-language model based on COMBO-NLP, an open-source natural language preprocessing system. It performs:

sentence segmentation (via LAMBO)
tokenisation (via LAMBO)
part-of-speech tagging
morphological analysis
lemmatisation
dependency parsing

The Chinese model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Chinese-GSDSimp (UD v2.17).

Evaluation

Evaluation was performed on the UD_Chinese-GSDSimp test split using the standard CoNLL 2018 eval script.

Two evaluation rows are reported:

Full-text (F1): raw text is segmented by LAMBO, then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality.
Aligned accuracy: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter.

Morphosyntactic Tagging

Metric	Tokens	Sentences	Words	UPOS	XPOS	UFeats	AllTags	Lemmas
Full-text (F1)	86.66	97.49	86.66	84.19	84.20	86.14	83.69	86.08
Aligned accuracy	0.00	0.00	0.00	97.15	97.17	99.40	96.57	99.33

Dependency Parsing

Metric	UAS	LAS	CLAS	MLAS	BLEX
Full-text (F1)	68.02	66.05	61.81	58.51	61.22
Aligned accuracy	78.49	76.21	75.20	71.19	74.49

Usage

Install the library from PyPI (assuming you have a virtual environment created):

pip install combo-nlp

Install the Lambo segmenter - only needed when passing raw text strings to COMBO:

pip install --index-url https://pypi.clarin-pl.eu/ lambo

from combo import COMBO

# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Chinese")

# Parse raw text (handles sentence splitting + tokenization)
result = nlp("敏捷的棕色狐狸跳过了懒狗。")

# Inspect results
for sentence in result:
    for token in sentence:
        print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head}  {token.deprel}")

Refer to the COMBO-NLP documentation for installation and usage instructions:

License

The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:

UD_Chinese-GSDSimp LICENSE.txt

Citation

If you use this model, please cite:

Ulewicz, M., Jabłońska, M., Klimaszewski, M., Przybyła, P., Pszenny, Ł., Rybak, P., Wiącek, M., & Wróblewska, A. (2026). COMBO-NLP Models Trained on UD v2.17. Zenodo. https://doi.org/10.5281/zenodo.19650523

@software{combo_nlp_2026,
  author    = {Ulewicz, Michał and Jabłońska, Maja and Klimaszewski, Mateusz and Przybyła, Piotr and Pszenny, Łukasz and Rybak, Piotr and Wiącek, Martyna and Wróblewska, Alina},
  title     = {{COMBO-NLP} Models Trained on {UD} v2.17},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19650523},
  url       = {https://doi.org/10.5281/zenodo.19650523}
}

Resources

COMBO-NLP: https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
LAMBO: https://gitlab.clarin-pl.eu/syntactic-tools/lambo
UD_Chinese-GSDSimp: https://github.com/UniversalDependencies/UD_Chinese-GSDSimp

Downloads last month: 24

Dataset used to train clarin-pl/combo-nlp-xlm-roberta-base-chinese-gsdsimp-ud2.17

Collection including clarin-pl/combo-nlp-xlm-roberta-base-chinese-gsdsimp-ud2.17

COMBO-NLP UD 2.17 Models

Collection

123 items • Updated 11 days ago