Model Overview

This repository contains a model fine-tuned on XLM-RoBerta-Base for Part-of-Speech tagging and lemmatization of (normalised) Old Icelandic texts.

The model was trained on all the available MENOTA texts by Andrea de Leeuw van Weenen (AM 132 fol., AM 519 a 4to., and AM 677 4to).

This is around 75% of all the currently available MENOTA texts, which are normalised, lemmatized, and (at least partially) POS-tagged.

Model Details

Model ID: OICE-LePOS-XLM-R-UD
Language: Old Icelandic (non)
Task: Token Classification: 1) UPOS token classification (17 classes), 2) edit-script seq2seq
Base Model: XLM-RoBerta-Base

Training Data

Corpus size: 397668 word tokens across 27740 text chunks.
Training-validation-test split: 80-10-10.
Sources: AM 132 fol., AM 519 a 4to, and AM 677 4to, edited and annotated by Andrea de Leeuw van Weenen.

Training

See training notebook for details.

Performance Metrics

Category	Accuracy
POS	0.9928
Lemma	0.9394

(Personally, I think it performs not as good as I would have expected on real world "out-of-domain" data.)

Notes

Menota POS-tags were mapped to Universal Dependencies Universal part-of-speech tags (UPOS), listed here according to their frequency in the training data:

MENOTA Tag	UD Tag	Notes
xVB	VERB
xNC	NOUN
xCU	CCONJ	Can be either CCONJ or SCONJ!
xAV	ADV
xAP	ADP
xPE	PRON
xDD	DET
xNP	PROPN
xAJ	ADJ
xRP	PART
xDP	DET
xVP	PART
xCC	CCONJ
xPI	PRON
xIM	PART
xNA	NUM
xCS	SCONJ
xPQ	PRON
xNO	NUM
xUA	—	Rarely used
xIT	INTJ
xPA	—	Occurs < 30 times
xEX	—	Occurs once

xCU is defined in MENOTA guidelines as "used if the encoder (in rather unusual cases) cannot decide whether a word is a conjunction or a subjunction" (Ch. 11, 11.5.11). It can be mapped either to CCONJ or SCONJ in UDP. For this model, it is mapped to CCONJ, but users should be aware of this ambiguity when using the model.
Model currently under evaluation for out-of-domain performance.
WIP for more granular tags.

Usage

Try it out in Google Colab (change runtime to GPU!).

# Install
pip install torch transformers huggingface_hub scikit-learn tqdm pandas

# Import packages
from huggingface_hub import snapshot_download
import os
os.mkdir("oice_lepos")

# Download model
snapshot_download("NKCZ/old-icelandic-lemma-pos-xlm-r-ud", local_dir="oice_lepos")

# Change to the dir with model
os.chdir("/content/oice_lepos")

# Import utils from the script
from OICE_LePOS_XLM_R_UD import OldIcelandicNLPTrainer

# Initialize the mode
oice = OldIcelandicNLPTrainer.from_pretrained("oi_model")

# Predict from a word list
results = oice.predict(["Almáttigr", "guð", "skapaði", "himin", "ok", "jǫrð", "."]) #Example from Prose Edda
for result in results:
  print(result)

#{'word': 'Almáttigr', 'upos': 'ADJ', 'lemma': 'almáttigr'}
#{'word': 'guð', 'upos': 'NOUN', 'lemma': 'guð'}
#...

# Predict from a raw string
results = trainer.predict_text("Almáttigr guð skapaði himin ok jǫrð ok alla þá hluti er þeim fylgja")

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for NKCZ/old-icelandic-lemma-pos-xlm-r-ud

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3893)

this model