kathnlp-xlmr · Katharevousa Greek dependency parser

kathnlp-xlmr is a fine-tuned XLM-RoBERTa base model for Universal-Dependencies-style morphological tagging and dependency parsing of Katharevousa Greek, the archaizing official register used in 20th-century Greek law, administration, and parliamentary discourse.

Results

Evaluated on the fixed 340-sentence test split (seed 42) of the kathnlp Katharevousa treebank.

Metric Score
UPOS accuracy 0.8893
DEPREL F1 (weighted) 0.7250
UAS 0.6098
LAS 0.5162

These scores outperform every off-the-shelf Greek and Ancient Greek baseline tested in the paper (best external: spaCy Greek with 0.4183 LAS). See the accompanying paper for the full benchmark including mBERT, custom-trained Stanza, and the feature-based baseline.

Usage

The recommended loader is the kathnlp package, which downloads the weights and reconstructs the custom parser architecture in one call.

pip install git+https://github.com/gmikros/katharevousa-nlp-tooling.git
from kathnlp.hub import load_from_hub

parser = load_from_hub("gmikros/kathnlp-xlmr")  # add device="cuda" for GPU

text = "Ἡ Κυβέρνησις παρακαλεῖται νά ἀποδεχθῇ τό αἴτημα τοῦ χωρίου."
for tok in parser.parse(text):
    print(tok.id, tok.form, tok.upos, tok.head, tok.deprel)

Raw artifacts

If you would rather load the weights manually:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(repo_id="gmikros/kathnlp-xlmr")
# Use:
#   {local_dir}/encoder/                 — fine-tuned XLM-R encoder (HF format)
#   {local_dir}/tokenizer/               — XLM-R tokenizer (HF format)
#   {local_dir}/parser_heads.pt          — custom UPOS, arc, head, and relation heads
#   {local_dir}/metadata.json            — UPOS / DEPREL label maps and training config

Architecture

The parser is a small custom head on top of XLM-R:

  • a UPOS classifier (linear over the encoder's hidden state),
  • two arc projections (dep_arc, head_arc) plus a root attention parameter to score every potential head for every word,
  • a relation classifier that consumes the head and dependent representations to predict the dependency label.

This is not a vanilla AutoModelForTokenClassification, which is why we ship a small loader (kathnlp.hub.load_from_hub) rather than relying on transformers auto-discovery.

Training

  • Base model: xlm-roberta-base
  • Data: 1,357 training sentences from gmikros/kathnlp-treebank (seed 42).
  • Epochs: 3, batch size 4, learning rate 2 × 10⁻⁵, weight decay 0.01.
  • Max sequence length: 256 subword tokens.
  • Loss weights: UPOS 1.0, arc 1.8, relation 1.2.

The exact training script lives at scripts/train_transformer_parser.py in the repository.

Limitations

  • Trained on 1,357 sentences from 1976–1977 parliamentary questions; transfer to other Katharevousa genres (decrees, newspapers, earlier legal texts) is untested.
  • Test split is small (340 sentences / 4,093 tokens).
  • Annotations are automatically validated rather than expert-adjudicated; an expert-reviewed v0.2 release is planned.
  • Sentence length must fit within 256 XLM-R subword tokens (~50–80 Greek words); longer inputs are truncated.

License

Released under Apache 2.0.

Citation

@misc{mikrosfitsilis2026kathnlp,
  title         = {A Reproducible Universal Dependencies-Style Pipeline for
                   Katharevousa Greek Parliamentary Text},
  author        = {Mikros, George and Fitsilis, Fotios},
  year          = {2026},
  eprint        = {2605.22978},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2605.22978}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gmikros/kathnlp-xlmr

Finetuned
(3992)
this model

Dataset used to train gmikros/kathnlp-xlmr

Space using gmikros/kathnlp-xlmr 1

Paper for gmikros/kathnlp-xlmr

Evaluation results