la_senter / README.md
diyclassics's picture
Update model card for 3.9.0
4465fd3 verified
metadata
tags:
  - spacy
  - latin
  - sentence-segmentation
language:
  - la
license: mit
model-index:
  - name: la_senter
    results:
      - task:
          name: SENTS
          type: token-classification
        metrics:
          - name: Sentences F-Score
            type: f_score
            value: 0.9971
          - name: Sentences Precision
            type: precision
            value: 0.9966
          - name: Sentences Recall
            type: recall
            value: 0.9976

la_senter

Latin sentence segmentation model for LatinCy.

Feature Description
Name la_senter
Version 3.9.0
spaCy >=3.8.0,<3.9.0
Default Pipeline senter
Components senter
Sources UD_Latin-Perseus, UD_Latin-PROIEL, UD_Latin-ITTB, UD_Latin-LLCT, UD_Latin-UDante
License MIT
Author Patrick J. Burns

Install

pip install https://huggingface.co/latincy/la_senter/resolve/main/la_senter-3.9.0-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("la_senter")

doc = nlp("Gallia est omnis divisa in partes tres. Quarum unam incolunt Belgae.")
for sent in doc.sents:
    print(sent.text)
# Gallia est omnis divisa in partes tres.
# Quarum unam incolunt Belgae.

doc = nlp("Iphicles, frater Herculis, magna voce exclamavit; sed Hercules ipse, fortissimus puer, haudquaquam territus est.")
for sent in doc.sents:
    print(sent.text)
# Iphicles, frater Herculis, magna voce exclamavit;
# sed Hercules ipse, fortissimus puer, haudquaquam territus est.

What's new in 3.9.0

  • Case-insensitive segmentation: correctly handles lowercased and mixed-case input
  • Sentence splitting on semicolons and colons
  • Bracketed reference handling: [2] O tempora... is treated as one sentence

Accuracy

Type Score
SENTS_F 99.71
SENTS_P 99.66
SENTS_R 99.76

Evaluated on held-out test split from the combined UD treebanks.

Intended use

Sentence segmentation of well-punctuated Latin text from digital editions, corpora, and scholarly sources. Not designed for punctuation-free text (scriptura continua).

Training

Trained on five Universal Dependencies Latin treebanks using spaCy's senter component.