Model Overview
This repository contains a model fine-tuned on XLM-RoBerta-Base for Part-of-Speech tagging and lemmatization of (normalised) Old Icelandic texts.
The model was trained on all the available MENOTA texts by Andrea de Leeuw van Weenen (AM 132 fol., AM 519 a 4to., and AM 677 4to).
This is around 75% of all the currently available MENOTA texts, which are normalised, lemmatized, and (at least partially) POS-tagged.
Model Details
- Model ID:
OICE-LePOS-XLM-R-UD - Language: Old Icelandic (non)
- Task: Token Classification: 1) UPOS token classification (17 classes), 2) edit-script seq2seq
- Base Model: XLM-RoBerta-Base
Training Data
Corpus size: 397668 word tokens across 27740 text chunks.
Training-validation-test split: 80-10-10.
Sources: AM 132 fol., AM 519 a 4to, and AM 677 4to, edited and annotated by Andrea de Leeuw van Weenen.
Training
See training notebook for details.
Performance Metrics
| Category | Accuracy |
|---|---|
| POS | 0.9928 |
| Lemma | 0.9394 |
(Personally, I think it performs not as good as I would have expected on real world "out-of-domain" data.)
Notes
- Menota POS-tags were mapped to Universal Dependencies Universal part-of-speech tags (UPOS), listed here according to their frequency in the training data:
| MENOTA Tag | UD Tag | Notes |
|---|---|---|
| xVB | VERB | |
| xNC | NOUN | |
| xCU | CCONJ | Can be either CCONJ or SCONJ! |
| xAV | ADV | |
| xAP | ADP | |
| xPE | PRON | |
| xDD | DET | |
| xNP | PROPN | |
| xAJ | ADJ | |
| xRP | PART | |
| xDP | DET | |
| xVP | PART | |
| xCC | CCONJ | |
| xPI | PRON | |
| xIM | PART | |
| xNA | NUM | |
| xCS | SCONJ | |
| xPQ | PRON | |
| xNO | NUM | |
| xUA | — | Rarely used |
| xIT | INTJ | |
| xPA | — | Occurs < 30 times |
| xEX | — | Occurs once |
xCU is defined in MENOTA guidelines as "used if the encoder (in rather unusual cases) cannot decide whether a word is a conjunction or a subjunction" (Ch. 11, 11.5.11). It can be mapped either to CCONJ or SCONJ in UDP. For this model, it is mapped to CCONJ, but users should be aware of this ambiguity when using the model.
Model currently under evaluation for out-of-domain performance.
WIP for more granular tags.
Usage
Try it out in Google Colab (change runtime to GPU!).
# Install
pip install torch transformers huggingface_hub scikit-learn tqdm pandas
# Import packages
from huggingface_hub import snapshot_download
import os
os.mkdir("oice_lepos")
# Download model
snapshot_download("NKCZ/old-icelandic-lemma-pos-xlm-r-ud", local_dir="oice_lepos")
# Change to the dir with model
os.chdir("/content/oice_lepos")
# Import utils from the script
from OICE_LePOS_XLM_R_UD import OldIcelandicNLPTrainer
# Initialize the mode
oice = OldIcelandicNLPTrainer.from_pretrained("oi_model")
# Predict from a word list
results = oice.predict(["Almáttigr", "guð", "skapaði", "himin", "ok", "jǫrð", "."]) #Example from Prose Edda
for result in results:
print(result)
#{'word': 'Almáttigr', 'upos': 'ADJ', 'lemma': 'almáttigr'}
#{'word': 'guð', 'upos': 'NOUN', 'lemma': 'guð'}
#...
# Predict from a raw string
results = trainer.predict_text("Almáttigr guð skapaði himin ok jǫrð ok alla þá hluti er þeim fylgja")
Model tree for NKCZ/old-icelandic-lemma-pos-xlm-r-ud
Base model
FacebookAI/xlm-roberta-base