Ancient Greek POS Tagger (XLM-RoBERTa)

This model is a fine-tuned version of xlm-roberta-base for Part-of-Speech (POS) tagging on Ancient Greek. It was trained on the Universal Dependencies Ancient Greek Perseus treebank.

It predicts Universal POS (UPOS) tags (17-label tagset) and achieves an overall test accuracy of 91.38%.

Model Details

  • Base Model: xlm-roberta-base
  • Language: Ancient Greek (grc)
  • Task: Token Classification (Part-of-Speech Tagging)
  • Label Set: 17 Universal Dependencies POS tags (ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X)

How to Use

from transformers import pipeline

# Load the POS tagging pipeline
nlp = pipeline(
    "token-classification", 
    model="your-username/your-model-name", 
    aggregation_strategy="first"
)

# Example: Opening line of the Iliad
text = "Μῆνιν ἄειδε θεά Πηληϊάδεω Ἀχιλῆος"
predictions = nlp(text)

for word in predictions:
    print(f"{word['word']:<15} -> {word['entity_group']}")

Expected output:

Μῆνιν           -> NOUN
ἄειδε           -> VERB
θεά             -> NOUN
Πηληϊάδεω       -> NOUN
Ἀχιλῆος         -> NOUN

Training Data

The model was fine-tuned on the UD_Ancient_Greek-Perseus dataset.

Training Procedure

Hyperparameters

  • Epochs: 5
  • Effective Batch Size: 16 (Batch Size 2 × Gradient Accumulation 8)
  • Learning Rate: 2e-05
  • Max Sequence Length: 128
  • Optimizer: AdamW

Evaluation Results

  • Overall Accuracy: 91.38%
  • Weighted Average F1: 92.18%
  • Macro Average F1: 73.68%

Per-Tag Metrics

Tag Precision Recall F1-Score Support
ADJ 0.7742 0.7617 0.7679 1796
ADP 0.9812 0.9881 0.9846 1423
ADV 0.9341 0.7602 0.8382 2481
AUX 0.9247 0.9214 0.9231 280
CCONJ 0.7611 0.8728 0.8131 668
DET 0.9979 0.9882 0.9930 2367
INTJ 0.9688 0.8857 0.9254 35
NOUN 0.9247 0.9361 0.9304 4489
NUM 0.0789 0.7500 0.1429 4
PART 0.0000 0.0000 0.0000 0
PRON 0.8624 0.8517 0.8570 1119
PUNCT 0.9991 0.9991 0.9991 2306
SCONJ 0.9161 0.9016 0.9088 315
VERB 0.9725 0.9659 0.9692 3406
X 0.0000 0.0000 0.0000 1

Note: PART and X had 0-1 support in the evaluation split, resulting in scores of 0.

Framework Versions

  • Transformers
  • PyTorch
  • Datasets
  • Tokenizers
Downloads last month
2
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train qbnguyen/ancient-greek-pos-xlmr