Model Overview

This repository contains a model fine-tuned on XLM-RoBerta-Base for Part-of-Speech tagging and lemmatization of (normalised) Old Icelandic texts.

The model was trained on all the available MENOTA texts by Andrea de Leeuw van Weenen (AM 132 fol., AM 519 a 4to., and AM 677 4to).

This is around 75% of all the currently available MENOTA texts, which are normalised, lemmatized, and (at least partially) POS-tagged.

Model Details

  • Model ID: OICE-LePOS-XLM-R-UD
  • Language: Old Icelandic (non)
  • Task: Token Classification: 1) UPOS token classification (17 classes), 2) edit-script seq2seq
  • Base Model: XLM-RoBerta-Base

Training Data

  • Corpus size: 397668 word tokens across 27740 text chunks.

  • Training-validation-test split: 80-10-10.

  • Sources: AM 132 fol., AM 519 a 4to, and AM 677 4to, edited and annotated by Andrea de Leeuw van Weenen.

Training

See training notebook for details.

Performance Metrics

Category Accuracy
POS 0.9928
Lemma 0.9394

(Personally, I think it performs not as good as I would have expected on real world "out-of-domain" data.)

Notes

MENOTA Tag UD Tag Notes
xVB VERB
xNC NOUN
xCU CCONJ Can be either CCONJ or SCONJ!
xAV ADV
xAP ADP
xPE PRON
xDD DET
xNP PROPN
xAJ ADJ
xRP PART
xDP DET
xVP PART
xCC CCONJ
xPI PRON
xIM PART
xNA NUM
xCS SCONJ
xPQ PRON
xNO NUM
xUA Rarely used
xIT INTJ
xPA Occurs < 30 times
xEX Occurs once
  • xCU is defined in MENOTA guidelines as "used if the encoder (in rather unusual cases) cannot decide whether a word is a conjunction or a subjunction" (Ch. 11, 11.5.11). It can be mapped either to CCONJ or SCONJ in UDP. For this model, it is mapped to CCONJ, but users should be aware of this ambiguity when using the model.

  • Model currently under evaluation for out-of-domain performance.

  • WIP for more granular tags.

Usage

Try it out in Google Colab (change runtime to GPU!).

# Install
pip install torch transformers huggingface_hub scikit-learn tqdm pandas

# Import packages
from huggingface_hub import snapshot_download
import os
os.mkdir("oice_lepos")

# Download model
snapshot_download("NKCZ/old-icelandic-lemma-pos-xlm-r-ud", local_dir="oice_lepos")

# Change to the dir with model
os.chdir("/content/oice_lepos")

# Import utils from the script
from OICE_LePOS_XLM_R_UD import OldIcelandicNLPTrainer

# Initialize the mode
oice = OldIcelandicNLPTrainer.from_pretrained("oi_model")

# Predict from a word list
results = oice.predict(["Almáttigr", "guð", "skapaði", "himin", "ok", "jǫrð", "."]) #Example from Prose Edda
for result in results:
  print(result)

#{'word': 'Almáttigr', 'upos': 'ADJ', 'lemma': 'almáttigr'}
#{'word': 'guð', 'upos': 'NOUN', 'lemma': 'guð'}
#...

# Predict from a raw string
results = trainer.predict_text("Almáttigr guð skapaði himin ok jǫrð ok alla þá hluti er þeim fylgja")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NKCZ/old-icelandic-lemma-pos-xlm-r-ud

Finetuned
(3893)
this model