metadata
tags:
- spacy
- latin
- sentence-segmentation
language:
- la
license: mit
model-index:
- name: la_senter
results:
- task:
name: SENTS
type: token-classification
metrics:
- name: Sentences F-Score
type: f_score
value: 0.9971
- name: Sentences Precision
type: precision
value: 0.9966
- name: Sentences Recall
type: recall
value: 0.9976
la_senter
Latin sentence segmentation model for LatinCy.
| Feature | Description |
|---|---|
| Name | la_senter |
| Version | 3.9.0 |
| spaCy | >=3.8.0,<3.9.0 |
| Default Pipeline | senter |
| Components | senter |
| Sources | UD_Latin-Perseus, UD_Latin-PROIEL, UD_Latin-ITTB, UD_Latin-LLCT, UD_Latin-UDante |
| License | MIT |
| Author | Patrick J. Burns |
Install
pip install https://huggingface.co/latincy/la_senter/resolve/main/la_senter-3.9.0-py3-none-any.whl
Usage
import spacy
nlp = spacy.load("la_senter")
doc = nlp("Gallia est omnis divisa in partes tres. Quarum unam incolunt Belgae.")
for sent in doc.sents:
print(sent.text)
# Gallia est omnis divisa in partes tres.
# Quarum unam incolunt Belgae.
doc = nlp("Iphicles, frater Herculis, magna voce exclamavit; sed Hercules ipse, fortissimus puer, haudquaquam territus est.")
for sent in doc.sents:
print(sent.text)
# Iphicles, frater Herculis, magna voce exclamavit;
# sed Hercules ipse, fortissimus puer, haudquaquam territus est.
What's new in 3.9.0
- Case-insensitive segmentation: correctly handles lowercased and mixed-case input
- Sentence splitting on semicolons and colons
- Bracketed reference handling:
[2] O tempora...is treated as one sentence
Accuracy
| Type | Score |
|---|---|
SENTS_F |
99.71 |
SENTS_P |
99.66 |
SENTS_R |
99.76 |
Evaluated on held-out test split from the combined UD treebanks.
Intended use
Sentence segmentation of well-punctuated Latin text from digital editions, corpora, and scholarly sources. Not designed for punctuation-free text (scriptura continua).
Training
Trained on five Universal Dependencies Latin treebanks using spaCy's senter component.