la_senter / README.md

diyclassics

Update model card for 3.9.0

4465fd3 verified about 2 months ago

preview code

raw

history blame contribute delete

2.27 kB

metadata

tags:
  - spacy
  - latin
  - sentence-segmentation
language:
  - la
license: mit
model-index:
  - name: la_senter
    results:
      - task:
          name: SENTS
          type: token-classification
        metrics:
          - name: Sentences F-Score
            type: f_score
            value: 0.9971
          - name: Sentences Precision
            type: precision
            value: 0.9966
          - name: Sentences Recall
            type: recall
            value: 0.9976

la_senter

Latin sentence segmentation model for LatinCy.

Feature	Description
Name	`la_senter`
Version	`3.9.0`
spaCy	`>=3.8.0,<3.9.0`
Default Pipeline	`senter`
Components	`senter`
Sources	UD_Latin-Perseus, UD_Latin-PROIEL, UD_Latin-ITTB, UD_Latin-LLCT, UD_Latin-UDante
License	`MIT`
Author	Patrick J. Burns

Install

pip install https://huggingface.co/latincy/la_senter/resolve/main/la_senter-3.9.0-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("la_senter")

doc = nlp("Gallia est omnis divisa in partes tres. Quarum unam incolunt Belgae.")
for sent in doc.sents:
    print(sent.text)
# Gallia est omnis divisa in partes tres.
# Quarum unam incolunt Belgae.

doc = nlp("Iphicles, frater Herculis, magna voce exclamavit; sed Hercules ipse, fortissimus puer, haudquaquam territus est.")
for sent in doc.sents:
    print(sent.text)
# Iphicles, frater Herculis, magna voce exclamavit;
# sed Hercules ipse, fortissimus puer, haudquaquam territus est.

What's new in 3.9.0

Case-insensitive segmentation: correctly handles lowercased and mixed-case input
Sentence splitting on semicolons and colons
Bracketed reference handling: [2] O tempora... is treated as one sentence

Accuracy

Type	Score
`SENTS_F`	99.71
`SENTS_P`	99.66
`SENTS_R`	99.76

Evaluated on held-out test split from the combined UD treebanks.

Intended use

Sentence segmentation of well-punctuated Latin text from digital editions, corpora, and scholarly sources. Not designed for punctuation-free text (scriptura continua).

Training

Trained on five Universal Dependencies Latin treebanks using spaCy's senter component.