File size: 8,038 Bytes

---
library_name: transformers
license: apache-2.0
base_model: google/byt5-small
language:
  - sw
  - zu
  - lg
  - ny
  - sn
  - kg
  - ki
  - kam
  - suk
  - mer
  - ln
  - nso
  - xh
  - nyf
  - rn
  - rw
tags:
  - bantu
  - morphology
  - multilingual
  - low-resource
  - character-level
  - byt5
  - african-languages
  - swahili
  - zulu
  - kikuyu
datasets:
  - mutisya/bantu-words-26-03-v3.5
---

# BantuMorph v7

BantuMorph is a character-level transformer for morphological analysis across 16 Bantu languages. Given a word in any of the supported languages, it can extract the lemma and morphological features, segment the word into morphemes, predict the noun class, or generate inflected forms from a lemma plus features.

The model is trained on 80,765 morphological paradigms across the 16 languages and operates over byte-level input, which lets it handle the rich agglutinative morphology of Bantu languages without word-piece tokenization artifacts.

## Quick summary

| Property | Value |
|---|---|
| Architecture | ByT5-small (encoder-decoder, character-level) |
| Parameters | 300M |
| Languages | 16 Bantu languages |
| Tasks | 5 (extract, segment, lemmatize, nounclass, complete) |
| Base model | [`google/byt5-small`](https://huggingface.co/google/byt5-small) |
| License | Apache-2.0 |

## Languages

| Code | Language | Guthrie zone | Approx. speakers (M) |
|------|----------|--------------|---------------------|
| swh | Swahili | G42 | 200 |
| zul | Zulu | S42 | 12 |
| xho | Xhosa | S41 | 8 |
| sna | Shona | S10 | 9 |
| nso | N. Sotho | S32 | 4 |
| nya | Chichewa | N31 | 14 |
| kik | Kikuyu | E51 | 8 |
| kam | Kamba | E55 | 5 |
| mer | Kimeru | E54 | 4 |
| nyf | Giriama | E72b | 0.6 |
| kin | Kinyarwanda | J61 | 12 |
| run | Kirundi | JD62 | 9 |
| lug | Luganda | JE15 | 8 |
| kon | Kongo | H16 | 5 |
| lin | Lingala | C40 | 40 |
| suk | Kisukuma | F21 | 5 |

## What the model does

BantuMorph supports five morphological tasks, each invoked through a task prefix on the input.

### Task 1 — Extract (lemma + features)

Joint lemmatization and feature prediction.

```
Input:  swh-extract: ninasoma
Output: soma V;PRS;1;SG
```

### Task 2 — Segment

Morpheme boundary detection.

```
Input:  swh-segment: ninasoma
Output: ni-na-soma
```

### Task 3 — Lemmatize

Extract the citation form.

```
Input:  swh-lemmatize: ninasoma
Output: soma
```

### Task 4 — Noun class

Predict the Bantu noun class for a noun.

```
Input:  swh-nounclass: mtoto
Output: BANTU1
```

### Task 5 — Complete (inflection)

Generate an inflected form from a lemma and features.

```
Input:  swh-complete: soma [V;PRS;1;SG]
Output: ninasoma
```

## How to use

```python
from transformers import T5ForConditionalGeneration, AutoTokenizer

model_id = "thiomi/bantumorph-v7"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)

def run(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128)
    outputs = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
print(run("swh-extract: ninasoma"))      # 'soma V;PRS;1;SG'
print(run("swh-segment: ninasoma"))      # 'ni-na-soma'
print(run("swh-lemmatize: ninasoma"))    # 'soma'
print(run("swh-nounclass: mtoto"))       # 'BANTU1'
print(run("swh-complete: soma [V;PRS;1;SG]"))  # 'ninasoma'
```

The task prefix is the language ISO code followed by the task name, separated by a hyphen. The supported language codes are listed in the table above (e.g. `swh-`, `kik-`, `zul-`).

## Evaluation

Evaluated on a held-out test set of 4,687 examples spanning all 16 languages and all 5 tasks (~290 examples per language on average, stratified by task).

### Per-task accuracy

| Task | Accuracy |
|---|---|
| segment | **96.1%** |
| nounclass | 87.8% |
| lemmatize | 82.3% |
| complete | 60.7% |
| extract | 42.9% |
| **Overall** | **67.1%** |

### Per-language accuracy (best to worst)

| Language | Accuracy |
|---|---|
| Shona | 94.4% |
| Chichewa | 89.6% |
| Luganda | 85.6% |
| Swahili | 83.2% |
| Kongo | 80.9% |
| (most other languages) | 60–80% |
| Northern Sotho | 44.7% |

For full per-task × per-language breakdown, see the BantuMorph paper.

### Notes on the evaluation

- Accuracy is exact-match on the model output. For segmentation specifically, ~45% of "errors" on common training vocabulary are actually valid alternative segmentations rather than incorrect ones — see the BantuMorph paper for the over-segmentation analysis.
- Languages with smaller training corpora (Northern Sotho, Xhosa, Kirundi, Kinyarwanda) tend to underperform languages with larger corpora.
- The hardest task is `extract` because of the large feature space; the easiest is `segment`.

## Training data

BantuMorph v7 was trained on 80,765 morphological paradigms drawn from:

- **UniMorph** Bantu paradigm collections for the languages that have them
- **LLM-generated paradigm extensions** from related Bantu languages, validated by community linguists
- **Cross-lingual transfer paradigms** from high-resource Bantu languages (primarily Swahili, Zulu, and Luganda)

Data was split 85% train / 10% validation / 5% test, with care taken to ensure speaker-disjoint and lemma-disjoint splits where possible.

## Limitations

- **Not a substitute for native-speaker validation.** The model is a useful starting point for morphological annotation, but generated outputs should be reviewed by speakers or linguists for any high-stakes use.
- **Accuracy varies sharply by language.** The 16 languages have very different amounts of training data; performance ranges from ~95% (Shona) to ~45% (Northern Sotho) overall.
- **Out-of-distribution loanwords.** The model can over-apply Bantu morphological templates to loanwords from English, Arabic, French, or Portuguese. Filtering loanwords is an open problem; see the related v3.5 dataset for one approach.
- **No tone marking.** The model treats text at the byte level and does not explicitly encode lexical tone. For tonal languages like Luganda, tonal distinctions are missing from both input and output.
- **Limited orthographic coverage.** Trained on standard Latin orthography for each language. Variant spellings (especially in less-standardized languages) may underperform.
- **Single-word inputs.** Each task expects a single word; running on multi-word phrases or full sentences will produce unreliable results.

## Intended use

BantuMorph is intended for:

- Computational linguistics research on Bantu languages
- Prototyping morphological analyzers for under-resourced Bantu languages via cross-lingual transfer
- Educational tools that need morphological breakdown (lemmatization, segmentation, noun class)
- Pre-processing for downstream NLP pipelines (information retrieval, search, named entity recognition)

It is not intended for:

- Production speech-to-text or translation systems on its own
- Definitive linguistic analysis without human review
- Sociolinguistic or dialect-specific analysis

## Related work

- **Zero-shot morphological discovery** — applies BantuMorph to Giriama with only 91 labeled paradigms. [arxiv:2604.22723](https://arxiv.org/abs/2604.22723)
- **Neural recovery of historical lexical structure** — uses BantuMorph embeddings to recover Proto-Bantu cognate structure. [arxiv:2604.22730](https://arxiv.org/abs/2604.22730)

## Citation

If you use BantuMorph in your work, please cite:

```bibtex
@misc{mutisya2026bantumorph,
  title  = {BantuMorph: A Character-Level Transformer for Morphological Analysis Across 16 Bantu Languages},
  author = {Hillary Mutisya and John Mugane},
  year   = {2026},
  note   = {Forthcoming on arXiv. Model available at \url{https://huggingface.co/thiomi/bantumorph-v7}}
}
```

## Model card authors

Hillary Mutisya, John Mugane

## Contact

For issues, questions, or collaboration, please open an issue on the model repository or contact the authors directly.