Ancient Greek POS Tagger (XLM-RoBERTa)

This model is a fine-tuned version of xlm-roberta-base for Part-of-Speech (POS) tagging on Ancient Greek. It was trained on the Universal Dependencies Ancient Greek Perseus treebank.

It predicts Universal POS (UPOS) tags (17-label tagset) and achieves an overall test accuracy of 91.38%.

Model Details

Base Model: xlm-roberta-base
Language: Ancient Greek (grc)
Task: Token Classification (Part-of-Speech Tagging)
Label Set: 17 Universal Dependencies POS tags (ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X)

How to Use

from transformers import pipeline

# Load the POS tagging pipeline
nlp = pipeline(
    "token-classification", 
    model="your-username/your-model-name", 
    aggregation_strategy="first"
)

# Example: Opening line of the Iliad
text = "Μῆνιν ἄειδε θεά Πηληϊάδεω Ἀχιλῆος"
predictions = nlp(text)

for word in predictions:
    print(f"{word['word']:<15} -> {word['entity_group']}")

Expected output:

Μῆνιν           -> NOUN
ἄειδε           -> VERB
θεά             -> NOUN
Πηληϊάδεω       -> NOUN
Ἀχιλῆος         -> NOUN

Training Data

The model was fine-tuned on the UD_Ancient_Greek-Perseus dataset.

Training Procedure

Hyperparameters

Epochs: 5
Effective Batch Size: 16 (Batch Size 2 × Gradient Accumulation 8)
Learning Rate: 2e-05
Max Sequence Length: 128
Optimizer: AdamW

Evaluation Results

Overall Accuracy: 91.38%
Weighted Average F1: 92.18%
Macro Average F1: 73.68%

Per-Tag Metrics

Tag	Precision	Recall	F1-Score	Support
ADJ	0.7742	0.7617	0.7679	1796
ADP	0.9812	0.9881	0.9846	1423
ADV	0.9341	0.7602	0.8382	2481
AUX	0.9247	0.9214	0.9231	280
CCONJ	0.7611	0.8728	0.8131	668
DET	0.9979	0.9882	0.9930	2367
INTJ	0.9688	0.8857	0.9254	35
NOUN	0.9247	0.9361	0.9304	4489
NUM	0.0789	0.7500	0.1429	4
PART	0.0000	0.0000	0.0000	0
PRON	0.8624	0.8517	0.8570	1119
PUNCT	0.9991	0.9991	0.9991	2306
SCONJ	0.9161	0.9016	0.9088	315
VERB	0.9725	0.9659	0.9692	3406
X	0.0000	0.0000	0.0000	1

Note: PART and X had 0-1 support in the evaluation split, resulting in scores of 0.

Framework Versions

Transformers
PyTorch
Datasets
Tokenizers

Downloads last month: 2

Safetensors

Model size

0.3B params

Tensor type

F32

qbnguyen
/

ancient-greek-pos-xlmr