Update README.md

2c3b4ad verified 5 days ago

8.04 kB

library_name: transformers
license: apache-2.0
base_model: google/byt5-small
language:
  - sw
  - zu
  - lg
  - ny
  - sn
  - kg
  - ki
  - kam
  - suk
  - mer
  - ln
  - nso
  - xh
  - nyf
  - rn
  - rw
tags:
  - bantu
  - morphology
  - multilingual
  - low-resource
  - character-level
  - byt5
  - african-languages
  - swahili
  - zulu
  - kikuyu
datasets:
  - mutisya/bantu-words-26-03-v3.5

BantuMorph v7

BantuMorph is a character-level transformer for morphological analysis across 16 Bantu languages. Given a word in any of the supported languages, it can extract the lemma and morphological features, segment the word into morphemes, predict the noun class, or generate inflected forms from a lemma plus features.

The model is trained on 80,765 morphological paradigms across the 16 languages and operates over byte-level input, which lets it handle the rich agglutinative morphology of Bantu languages without word-piece tokenization artifacts.

Quick summary

Property	Value
Architecture	ByT5-small (encoder-decoder, character-level)
Parameters	300M
Languages	16 Bantu languages
Tasks	5 (extract, segment, lemmatize, nounclass, complete)
Base model	`google/byt5-small`
License	Apache-2.0

Languages

Code	Language	Guthrie zone	Approx. speakers (M)
swh	Swahili	G42	200
zul	Zulu	S42	12
xho	Xhosa	S41	8
sna	Shona	S10	9
nso	N. Sotho	S32	4
nya	Chichewa	N31	14
kik	Kikuyu	E51	8
kam	Kamba	E55	5
mer	Kimeru	E54	4
nyf	Giriama	E72b	0.6
kin	Kinyarwanda	J61	12
run	Kirundi	JD62	9
lug	Luganda	JE15	8
kon	Kongo	H16	5
lin	Lingala	C40	40
suk	Kisukuma	F21	5

What the model does

BantuMorph supports five morphological tasks, each invoked through a task prefix on the input.

Task 1 — Extract (lemma + features)

Joint lemmatization and feature prediction.

Input:  swh-extract: ninasoma
Output: soma V;PRS;1;SG

Task 2 — Segment

Morpheme boundary detection.

Input:  swh-segment: ninasoma
Output: ni-na-soma

Task 3 — Lemmatize

Extract the citation form.

Input:  swh-lemmatize: ninasoma
Output: soma

Task 4 — Noun class

Predict the Bantu noun class for a noun.

Input:  swh-nounclass: mtoto
Output: BANTU1

Task 5 — Complete (inflection)

Generate an inflected form from a lemma and features.

Input:  swh-complete: soma [V;PRS;1;SG]
Output: ninasoma

How to use

from transformers import T5ForConditionalGeneration, AutoTokenizer

model_id = "thiomi/bantumorph-v7"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)

def run(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128)
    outputs = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
print(run("swh-extract: ninasoma"))      # 'soma V;PRS;1;SG'
print(run("swh-segment: ninasoma"))      # 'ni-na-soma'
print(run("swh-lemmatize: ninasoma"))    # 'soma'
print(run("swh-nounclass: mtoto"))       # 'BANTU1'
print(run("swh-complete: soma [V;PRS;1;SG]"))  # 'ninasoma'

The task prefix is the language ISO code followed by the task name, separated by a hyphen. The supported language codes are listed in the table above (e.g. swh-, kik-, zul-).

Evaluation

Evaluated on a held-out test set of 4,687 examples spanning all 16 languages and all 5 tasks (~290 examples per language on average, stratified by task).

Per-task accuracy

Task	Accuracy
segment	96.1%
nounclass	87.8%
lemmatize	82.3%
complete	60.7%
extract	42.9%
Overall	67.1%

Per-language accuracy (best to worst)

Language	Accuracy
Shona	94.4%
Chichewa	89.6%
Luganda	85.6%
Swahili	83.2%
Kongo	80.9%
(most other languages)	60–80%
Northern Sotho	44.7%

For full per-task × per-language breakdown, see the BantuMorph paper.

Notes on the evaluation

Accuracy is exact-match on the model output. For segmentation specifically, ~45% of "errors" on common training vocabulary are actually valid alternative segmentations rather than incorrect ones — see the BantuMorph paper for the over-segmentation analysis.
Languages with smaller training corpora (Northern Sotho, Xhosa, Kirundi, Kinyarwanda) tend to underperform languages with larger corpora.
The hardest task is extract because of the large feature space; the easiest is segment.

Training data

BantuMorph v7 was trained on 80,765 morphological paradigms drawn from:

UniMorph Bantu paradigm collections for the languages that have them
LLM-generated paradigm extensions from related Bantu languages, validated by community linguists
Cross-lingual transfer paradigms from high-resource Bantu languages (primarily Swahili, Zulu, and Luganda)

Data was split 85% train / 10% validation / 5% test, with care taken to ensure speaker-disjoint and lemma-disjoint splits where possible.

Limitations

Not a substitute for native-speaker validation. The model is a useful starting point for morphological annotation, but generated outputs should be reviewed by speakers or linguists for any high-stakes use.
Accuracy varies sharply by language. The 16 languages have very different amounts of training data; performance ranges from ~95% (Shona) to ~45% (Northern Sotho) overall.
Out-of-distribution loanwords. The model can over-apply Bantu morphological templates to loanwords from English, Arabic, French, or Portuguese. Filtering loanwords is an open problem; see the related v3.5 dataset for one approach.
No tone marking. The model treats text at the byte level and does not explicitly encode lexical tone. For tonal languages like Luganda, tonal distinctions are missing from both input and output.
Limited orthographic coverage. Trained on standard Latin orthography for each language. Variant spellings (especially in less-standardized languages) may underperform.
Single-word inputs. Each task expects a single word; running on multi-word phrases or full sentences will produce unreliable results.

Intended use

BantuMorph is intended for:

Computational linguistics research on Bantu languages
Prototyping morphological analyzers for under-resourced Bantu languages via cross-lingual transfer
Educational tools that need morphological breakdown (lemmatization, segmentation, noun class)
Pre-processing for downstream NLP pipelines (information retrieval, search, named entity recognition)

It is not intended for:

Production speech-to-text or translation systems on its own
Definitive linguistic analysis without human review
Sociolinguistic or dialect-specific analysis

Related work

Zero-shot morphological discovery — applies BantuMorph to Giriama with only 91 labeled paradigms. arxiv:2604.22723
Neural recovery of historical lexical structure — uses BantuMorph embeddings to recover Proto-Bantu cognate structure. arxiv:2604.22730

Citation

If you use BantuMorph in your work, please cite:

@misc{mutisya2026bantumorph,
  title  = {BantuMorph: A Character-Level Transformer for Morphological Analysis Across 16 Bantu Languages},
  author = {Hillary Mutisya and John Mugane},
  year   = {2026},
  note   = {Forthcoming on arXiv. Model available at \url{https://huggingface.co/thiomi/bantumorph-v7}}
}

Model card authors

Hillary Mutisya, John Mugane

Contact

For issues, questions, or collaboration, please open an issue on the model repository or contact the authors directly.