kathnlp-xlmr · Katharevousa Greek dependency parser
kathnlp-xlmr is a fine-tuned XLM-RoBERTa base model for Universal-Dependencies-style morphological tagging and dependency parsing of Katharevousa Greek, the archaizing official register used in 20th-century Greek law, administration, and parliamentary discourse.
- Paper: arXiv:2605.22978
- Code: https://github.com/gmikros/katharevousa-nlp-tooling
- Dataset:
gmikros/kathnlp-treebank - Status: v0.1 research preview — annotations automatically validated, philologist adjudication in progress.
Results
Evaluated on the fixed 340-sentence test split (seed 42) of the kathnlp Katharevousa treebank.
| Metric | Score |
|---|---|
| UPOS accuracy | 0.8893 |
| DEPREL F1 (weighted) | 0.7250 |
| UAS | 0.6098 |
| LAS | 0.5162 |
These scores outperform every off-the-shelf Greek and Ancient Greek baseline tested in the paper (best external: spaCy Greek with 0.4183 LAS). See the accompanying paper for the full benchmark including mBERT, custom-trained Stanza, and the feature-based baseline.
Usage
The recommended loader is the kathnlp package, which downloads the weights and reconstructs the custom parser architecture in one call.
pip install git+https://github.com/gmikros/katharevousa-nlp-tooling.git
from kathnlp.hub import load_from_hub
parser = load_from_hub("gmikros/kathnlp-xlmr") # add device="cuda" for GPU
text = "Ἡ Κυβέρνησις παρακαλεῖται νά ἀποδεχθῇ τό αἴτημα τοῦ χωρίου."
for tok in parser.parse(text):
print(tok.id, tok.form, tok.upos, tok.head, tok.deprel)
Raw artifacts
If you would rather load the weights manually:
from huggingface_hub import snapshot_download
local_dir = snapshot_download(repo_id="gmikros/kathnlp-xlmr")
# Use:
# {local_dir}/encoder/ — fine-tuned XLM-R encoder (HF format)
# {local_dir}/tokenizer/ — XLM-R tokenizer (HF format)
# {local_dir}/parser_heads.pt — custom UPOS, arc, head, and relation heads
# {local_dir}/metadata.json — UPOS / DEPREL label maps and training config
Architecture
The parser is a small custom head on top of XLM-R:
- a UPOS classifier (linear over the encoder's hidden state),
- two arc projections (
dep_arc,head_arc) plus a root attention parameter to score every potential head for every word, - a relation classifier that consumes the head and dependent representations to predict the dependency label.
This is not a vanilla AutoModelForTokenClassification, which is why we ship a small loader (kathnlp.hub.load_from_hub) rather than relying on transformers auto-discovery.
Training
- Base model:
xlm-roberta-base - Data: 1,357 training sentences from
gmikros/kathnlp-treebank(seed 42). - Epochs: 3, batch size 4, learning rate 2 × 10⁻⁵, weight decay 0.01.
- Max sequence length: 256 subword tokens.
- Loss weights: UPOS 1.0, arc 1.8, relation 1.2.
The exact training script lives at scripts/train_transformer_parser.py in the repository.
Limitations
- Trained on 1,357 sentences from 1976–1977 parliamentary questions; transfer to other Katharevousa genres (decrees, newspapers, earlier legal texts) is untested.
- Test split is small (340 sentences / 4,093 tokens).
- Annotations are automatically validated rather than expert-adjudicated; an expert-reviewed v0.2 release is planned.
- Sentence length must fit within 256 XLM-R subword tokens (~50–80 Greek words); longer inputs are truncated.
License
Released under Apache 2.0.
Citation
@misc{mikrosfitsilis2026kathnlp,
title = {A Reproducible Universal Dependencies-Style Pipeline for
Katharevousa Greek Parliamentary Text},
author = {Mikros, George and Fitsilis, Fotios},
year = {2026},
eprint = {2605.22978},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2605.22978}
}
Model tree for gmikros/kathnlp-xlmr
Base model
FacebookAI/xlm-roberta-baseDataset used to train gmikros/kathnlp-xlmr
Space using gmikros/kathnlp-xlmr 1
Paper for gmikros/kathnlp-xlmr
Evaluation results
- UPOS accuracy on kathnlp Katharevousa treebank (test split, seed 42)test set self-reported0.889
- DEPREL F1 (weighted) on kathnlp Katharevousa treebank (test split, seed 42)test set self-reported0.725
- UAS on kathnlp Katharevousa treebank (test split, seed 42)test set self-reported0.610
- LAS on kathnlp Katharevousa treebank (test split, seed 42)test set self-reported0.516