Ancient Greek POS Tagger (XLM-RoBERTa)
This model is a fine-tuned version of xlm-roberta-base for Part-of-Speech (POS) tagging on Ancient Greek. It was trained on the Universal Dependencies Ancient Greek Perseus treebank.
It predicts Universal POS (UPOS) tags (17-label tagset) and achieves an overall test accuracy of 91.38%.
Model Details
- Base Model:
xlm-roberta-base - Language: Ancient Greek (
grc) - Task: Token Classification (Part-of-Speech Tagging)
- Label Set: 17 Universal Dependencies POS tags (ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X)
How to Use
from transformers import pipeline
# Load the POS tagging pipeline
nlp = pipeline(
"token-classification",
model="your-username/your-model-name",
aggregation_strategy="first"
)
# Example: Opening line of the Iliad
text = "Μῆνιν ἄειδε θεά Πηληϊάδεω Ἀχιλῆος"
predictions = nlp(text)
for word in predictions:
print(f"{word['word']:<15} -> {word['entity_group']}")
Expected output:
Μῆνιν -> NOUN
ἄειδε -> VERB
θεά -> NOUN
Πηληϊάδεω -> NOUN
Ἀχιλῆος -> NOUN
Training Data
The model was fine-tuned on the UD_Ancient_Greek-Perseus dataset.
Training Procedure
Hyperparameters
- Epochs: 5
- Effective Batch Size: 16 (Batch Size 2 × Gradient Accumulation 8)
- Learning Rate: 2e-05
- Max Sequence Length: 128
- Optimizer: AdamW
Evaluation Results
- Overall Accuracy: 91.38%
- Weighted Average F1: 92.18%
- Macro Average F1: 73.68%
Per-Tag Metrics
| Tag | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| ADJ | 0.7742 | 0.7617 | 0.7679 | 1796 |
| ADP | 0.9812 | 0.9881 | 0.9846 | 1423 |
| ADV | 0.9341 | 0.7602 | 0.8382 | 2481 |
| AUX | 0.9247 | 0.9214 | 0.9231 | 280 |
| CCONJ | 0.7611 | 0.8728 | 0.8131 | 668 |
| DET | 0.9979 | 0.9882 | 0.9930 | 2367 |
| INTJ | 0.9688 | 0.8857 | 0.9254 | 35 |
| NOUN | 0.9247 | 0.9361 | 0.9304 | 4489 |
| NUM | 0.0789 | 0.7500 | 0.1429 | 4 |
| PART | 0.0000 | 0.0000 | 0.0000 | 0 |
| PRON | 0.8624 | 0.8517 | 0.8570 | 1119 |
| PUNCT | 0.9991 | 0.9991 | 0.9991 | 2306 |
| SCONJ | 0.9161 | 0.9016 | 0.9088 | 315 |
| VERB | 0.9725 | 0.9659 | 0.9692 | 3406 |
| X | 0.0000 | 0.0000 | 0.0000 | 1 |
Note: PART and X had 0-1 support in the evaluation split, resulting in scores of 0.
Framework Versions
- Transformers
- PyTorch
- Datasets
- Tokenizers
- Downloads last month
- 2