bantumorph-v7 / README.md
mutisya's picture
Update README.md
2c3b4ad verified
---
library_name: transformers
license: apache-2.0
base_model: google/byt5-small
language:
- sw
- zu
- lg
- ny
- sn
- kg
- ki
- kam
- suk
- mer
- ln
- nso
- xh
- nyf
- rn
- rw
tags:
- bantu
- morphology
- multilingual
- low-resource
- character-level
- byt5
- african-languages
- swahili
- zulu
- kikuyu
datasets:
- mutisya/bantu-words-26-03-v3.5
---
# BantuMorph v7
BantuMorph is a character-level transformer for morphological analysis across 16 Bantu languages. Given a word in any of the supported languages, it can extract the lemma and morphological features, segment the word into morphemes, predict the noun class, or generate inflected forms from a lemma plus features.
The model is trained on 80,765 morphological paradigms across the 16 languages and operates over byte-level input, which lets it handle the rich agglutinative morphology of Bantu languages without word-piece tokenization artifacts.
## Quick summary
| Property | Value |
|---|---|
| Architecture | ByT5-small (encoder-decoder, character-level) |
| Parameters | 300M |
| Languages | 16 Bantu languages |
| Tasks | 5 (extract, segment, lemmatize, nounclass, complete) |
| Base model | [`google/byt5-small`](https://huggingface.co/google/byt5-small) |
| License | Apache-2.0 |
## Languages
| Code | Language | Guthrie zone | Approx. speakers (M) |
|------|----------|--------------|---------------------|
| swh | Swahili | G42 | 200 |
| zul | Zulu | S42 | 12 |
| xho | Xhosa | S41 | 8 |
| sna | Shona | S10 | 9 |
| nso | N. Sotho | S32 | 4 |
| nya | Chichewa | N31 | 14 |
| kik | Kikuyu | E51 | 8 |
| kam | Kamba | E55 | 5 |
| mer | Kimeru | E54 | 4 |
| nyf | Giriama | E72b | 0.6 |
| kin | Kinyarwanda | J61 | 12 |
| run | Kirundi | JD62 | 9 |
| lug | Luganda | JE15 | 8 |
| kon | Kongo | H16 | 5 |
| lin | Lingala | C40 | 40 |
| suk | Kisukuma | F21 | 5 |
## What the model does
BantuMorph supports five morphological tasks, each invoked through a task prefix on the input.
### Task 1 — Extract (lemma + features)
Joint lemmatization and feature prediction.
```
Input: swh-extract: ninasoma
Output: soma V;PRS;1;SG
```
### Task 2 — Segment
Morpheme boundary detection.
```
Input: swh-segment: ninasoma
Output: ni-na-soma
```
### Task 3 — Lemmatize
Extract the citation form.
```
Input: swh-lemmatize: ninasoma
Output: soma
```
### Task 4 — Noun class
Predict the Bantu noun class for a noun.
```
Input: swh-nounclass: mtoto
Output: BANTU1
```
### Task 5 — Complete (inflection)
Generate an inflected form from a lemma and features.
```
Input: swh-complete: soma [V;PRS;1;SG]
Output: ninasoma
```
## How to use
```python
from transformers import T5ForConditionalGeneration, AutoTokenizer
model_id = "thiomi/bantumorph-v7"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)
def run(prompt: str) -> str:
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model.generate(**inputs, max_new_tokens=64)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Examples
print(run("swh-extract: ninasoma")) # 'soma V;PRS;1;SG'
print(run("swh-segment: ninasoma")) # 'ni-na-soma'
print(run("swh-lemmatize: ninasoma")) # 'soma'
print(run("swh-nounclass: mtoto")) # 'BANTU1'
print(run("swh-complete: soma [V;PRS;1;SG]")) # 'ninasoma'
```
The task prefix is the language ISO code followed by the task name, separated by a hyphen. The supported language codes are listed in the table above (e.g. `swh-`, `kik-`, `zul-`).
## Evaluation
Evaluated on a held-out test set of 4,687 examples spanning all 16 languages and all 5 tasks (~290 examples per language on average, stratified by task).
### Per-task accuracy
| Task | Accuracy |
|---|---|
| segment | **96.1%** |
| nounclass | 87.8% |
| lemmatize | 82.3% |
| complete | 60.7% |
| extract | 42.9% |
| **Overall** | **67.1%** |
### Per-language accuracy (best to worst)
| Language | Accuracy |
|---|---|
| Shona | 94.4% |
| Chichewa | 89.6% |
| Luganda | 85.6% |
| Swahili | 83.2% |
| Kongo | 80.9% |
| (most other languages) | 60–80% |
| Northern Sotho | 44.7% |
For full per-task × per-language breakdown, see the BantuMorph paper.
### Notes on the evaluation
- Accuracy is exact-match on the model output. For segmentation specifically, ~45% of "errors" on common training vocabulary are actually valid alternative segmentations rather than incorrect ones — see the BantuMorph paper for the over-segmentation analysis.
- Languages with smaller training corpora (Northern Sotho, Xhosa, Kirundi, Kinyarwanda) tend to underperform languages with larger corpora.
- The hardest task is `extract` because of the large feature space; the easiest is `segment`.
## Training data
BantuMorph v7 was trained on 80,765 morphological paradigms drawn from:
- **UniMorph** Bantu paradigm collections for the languages that have them
- **LLM-generated paradigm extensions** from related Bantu languages, validated by community linguists
- **Cross-lingual transfer paradigms** from high-resource Bantu languages (primarily Swahili, Zulu, and Luganda)
Data was split 85% train / 10% validation / 5% test, with care taken to ensure speaker-disjoint and lemma-disjoint splits where possible.
## Limitations
- **Not a substitute for native-speaker validation.** The model is a useful starting point for morphological annotation, but generated outputs should be reviewed by speakers or linguists for any high-stakes use.
- **Accuracy varies sharply by language.** The 16 languages have very different amounts of training data; performance ranges from ~95% (Shona) to ~45% (Northern Sotho) overall.
- **Out-of-distribution loanwords.** The model can over-apply Bantu morphological templates to loanwords from English, Arabic, French, or Portuguese. Filtering loanwords is an open problem; see the related v3.5 dataset for one approach.
- **No tone marking.** The model treats text at the byte level and does not explicitly encode lexical tone. For tonal languages like Luganda, tonal distinctions are missing from both input and output.
- **Limited orthographic coverage.** Trained on standard Latin orthography for each language. Variant spellings (especially in less-standardized languages) may underperform.
- **Single-word inputs.** Each task expects a single word; running on multi-word phrases or full sentences will produce unreliable results.
## Intended use
BantuMorph is intended for:
- Computational linguistics research on Bantu languages
- Prototyping morphological analyzers for under-resourced Bantu languages via cross-lingual transfer
- Educational tools that need morphological breakdown (lemmatization, segmentation, noun class)
- Pre-processing for downstream NLP pipelines (information retrieval, search, named entity recognition)
It is not intended for:
- Production speech-to-text or translation systems on its own
- Definitive linguistic analysis without human review
- Sociolinguistic or dialect-specific analysis
## Related work
- **Zero-shot morphological discovery** — applies BantuMorph to Giriama with only 91 labeled paradigms. [arxiv:2604.22723](https://arxiv.org/abs/2604.22723)
- **Neural recovery of historical lexical structure** — uses BantuMorph embeddings to recover Proto-Bantu cognate structure. [arxiv:2604.22730](https://arxiv.org/abs/2604.22730)
## Citation
If you use BantuMorph in your work, please cite:
```bibtex
@misc{mutisya2026bantumorph,
title = {BantuMorph: A Character-Level Transformer for Morphological Analysis Across 16 Bantu Languages},
author = {Hillary Mutisya and John Mugane},
year = {2026},
note = {Forthcoming on arXiv. Model available at \url{https://huggingface.co/thiomi/bantumorph-v7}}
}
```
## Model card authors
Hillary Mutisya, John Mugane
## Contact
For issues, questions, or collaboration, please open an issue on the model repository or contact the authors directly.