| --- |
| library_name: transformers |
| license: apache-2.0 |
| base_model: google/byt5-small |
| language: |
| - sw |
| - zu |
| - lg |
| - ny |
| - sn |
| - kg |
| - ki |
| - kam |
| - suk |
| - mer |
| - ln |
| - nso |
| - xh |
| - nyf |
| - rn |
| - rw |
| tags: |
| - bantu |
| - morphology |
| - multilingual |
| - low-resource |
| - character-level |
| - byt5 |
| - african-languages |
| - swahili |
| - zulu |
| - kikuyu |
| datasets: |
| - mutisya/bantu-words-26-03-v3.5 |
| --- |
| |
| # BantuMorph v7 |
|
|
| BantuMorph is a character-level transformer for morphological analysis across 16 Bantu languages. Given a word in any of the supported languages, it can extract the lemma and morphological features, segment the word into morphemes, predict the noun class, or generate inflected forms from a lemma plus features. |
|
|
| The model is trained on 80,765 morphological paradigms across the 16 languages and operates over byte-level input, which lets it handle the rich agglutinative morphology of Bantu languages without word-piece tokenization artifacts. |
|
|
| ## Quick summary |
|
|
| | Property | Value | |
| |---|---| |
| | Architecture | ByT5-small (encoder-decoder, character-level) | |
| | Parameters | 300M | |
| | Languages | 16 Bantu languages | |
| | Tasks | 5 (extract, segment, lemmatize, nounclass, complete) | |
| | Base model | [`google/byt5-small`](https://huggingface.co/google/byt5-small) | |
| | License | Apache-2.0 | |
|
|
| ## Languages |
|
|
| | Code | Language | Guthrie zone | Approx. speakers (M) | |
| |------|----------|--------------|---------------------| |
| | swh | Swahili | G42 | 200 | |
| | zul | Zulu | S42 | 12 | |
| | xho | Xhosa | S41 | 8 | |
| | sna | Shona | S10 | 9 | |
| | nso | N. Sotho | S32 | 4 | |
| | nya | Chichewa | N31 | 14 | |
| | kik | Kikuyu | E51 | 8 | |
| | kam | Kamba | E55 | 5 | |
| | mer | Kimeru | E54 | 4 | |
| | nyf | Giriama | E72b | 0.6 | |
| | kin | Kinyarwanda | J61 | 12 | |
| | run | Kirundi | JD62 | 9 | |
| | lug | Luganda | JE15 | 8 | |
| | kon | Kongo | H16 | 5 | |
| | lin | Lingala | C40 | 40 | |
| | suk | Kisukuma | F21 | 5 | |
|
|
| ## What the model does |
|
|
| BantuMorph supports five morphological tasks, each invoked through a task prefix on the input. |
|
|
| ### Task 1 — Extract (lemma + features) |
|
|
| Joint lemmatization and feature prediction. |
|
|
| ``` |
| Input: swh-extract: ninasoma |
| Output: soma V;PRS;1;SG |
| ``` |
|
|
| ### Task 2 — Segment |
|
|
| Morpheme boundary detection. |
|
|
| ``` |
| Input: swh-segment: ninasoma |
| Output: ni-na-soma |
| ``` |
|
|
| ### Task 3 — Lemmatize |
|
|
| Extract the citation form. |
|
|
| ``` |
| Input: swh-lemmatize: ninasoma |
| Output: soma |
| ``` |
|
|
| ### Task 4 — Noun class |
|
|
| Predict the Bantu noun class for a noun. |
|
|
| ``` |
| Input: swh-nounclass: mtoto |
| Output: BANTU1 |
| ``` |
|
|
| ### Task 5 — Complete (inflection) |
|
|
| Generate an inflected form from a lemma and features. |
|
|
| ``` |
| Input: swh-complete: soma [V;PRS;1;SG] |
| Output: ninasoma |
| ``` |
|
|
| ## How to use |
|
|
| ```python |
| from transformers import T5ForConditionalGeneration, AutoTokenizer |
| |
| model_id = "thiomi/bantumorph-v7" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = T5ForConditionalGeneration.from_pretrained(model_id) |
| |
| def run(prompt: str) -> str: |
| inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128) |
| outputs = model.generate(**inputs, max_new_tokens=64) |
| return tokenizer.decode(outputs[0], skip_special_tokens=True) |
| |
| # Examples |
| print(run("swh-extract: ninasoma")) # 'soma V;PRS;1;SG' |
| print(run("swh-segment: ninasoma")) # 'ni-na-soma' |
| print(run("swh-lemmatize: ninasoma")) # 'soma' |
| print(run("swh-nounclass: mtoto")) # 'BANTU1' |
| print(run("swh-complete: soma [V;PRS;1;SG]")) # 'ninasoma' |
| ``` |
|
|
| The task prefix is the language ISO code followed by the task name, separated by a hyphen. The supported language codes are listed in the table above (e.g. `swh-`, `kik-`, `zul-`). |
|
|
| ## Evaluation |
|
|
| Evaluated on a held-out test set of 4,687 examples spanning all 16 languages and all 5 tasks (~290 examples per language on average, stratified by task). |
|
|
| ### Per-task accuracy |
|
|
| | Task | Accuracy | |
| |---|---| |
| | segment | **96.1%** | |
| | nounclass | 87.8% | |
| | lemmatize | 82.3% | |
| | complete | 60.7% | |
| | extract | 42.9% | |
| | **Overall** | **67.1%** | |
|
|
| ### Per-language accuracy (best to worst) |
|
|
| | Language | Accuracy | |
| |---|---| |
| | Shona | 94.4% | |
| | Chichewa | 89.6% | |
| | Luganda | 85.6% | |
| | Swahili | 83.2% | |
| | Kongo | 80.9% | |
| | (most other languages) | 60–80% | |
| | Northern Sotho | 44.7% | |
|
|
| For full per-task × per-language breakdown, see the BantuMorph paper. |
|
|
| ### Notes on the evaluation |
|
|
| - Accuracy is exact-match on the model output. For segmentation specifically, ~45% of "errors" on common training vocabulary are actually valid alternative segmentations rather than incorrect ones — see the BantuMorph paper for the over-segmentation analysis. |
| - Languages with smaller training corpora (Northern Sotho, Xhosa, Kirundi, Kinyarwanda) tend to underperform languages with larger corpora. |
| - The hardest task is `extract` because of the large feature space; the easiest is `segment`. |
|
|
| ## Training data |
|
|
| BantuMorph v7 was trained on 80,765 morphological paradigms drawn from: |
|
|
| - **UniMorph** Bantu paradigm collections for the languages that have them |
| - **LLM-generated paradigm extensions** from related Bantu languages, validated by community linguists |
| - **Cross-lingual transfer paradigms** from high-resource Bantu languages (primarily Swahili, Zulu, and Luganda) |
|
|
| Data was split 85% train / 10% validation / 5% test, with care taken to ensure speaker-disjoint and lemma-disjoint splits where possible. |
|
|
| ## Limitations |
|
|
| - **Not a substitute for native-speaker validation.** The model is a useful starting point for morphological annotation, but generated outputs should be reviewed by speakers or linguists for any high-stakes use. |
| - **Accuracy varies sharply by language.** The 16 languages have very different amounts of training data; performance ranges from ~95% (Shona) to ~45% (Northern Sotho) overall. |
| - **Out-of-distribution loanwords.** The model can over-apply Bantu morphological templates to loanwords from English, Arabic, French, or Portuguese. Filtering loanwords is an open problem; see the related v3.5 dataset for one approach. |
| - **No tone marking.** The model treats text at the byte level and does not explicitly encode lexical tone. For tonal languages like Luganda, tonal distinctions are missing from both input and output. |
| - **Limited orthographic coverage.** Trained on standard Latin orthography for each language. Variant spellings (especially in less-standardized languages) may underperform. |
| - **Single-word inputs.** Each task expects a single word; running on multi-word phrases or full sentences will produce unreliable results. |
|
|
| ## Intended use |
|
|
| BantuMorph is intended for: |
|
|
| - Computational linguistics research on Bantu languages |
| - Prototyping morphological analyzers for under-resourced Bantu languages via cross-lingual transfer |
| - Educational tools that need morphological breakdown (lemmatization, segmentation, noun class) |
| - Pre-processing for downstream NLP pipelines (information retrieval, search, named entity recognition) |
|
|
| It is not intended for: |
|
|
| - Production speech-to-text or translation systems on its own |
| - Definitive linguistic analysis without human review |
| - Sociolinguistic or dialect-specific analysis |
|
|
| ## Related work |
|
|
| - **Zero-shot morphological discovery** — applies BantuMorph to Giriama with only 91 labeled paradigms. [arxiv:2604.22723](https://arxiv.org/abs/2604.22723) |
| - **Neural recovery of historical lexical structure** — uses BantuMorph embeddings to recover Proto-Bantu cognate structure. [arxiv:2604.22730](https://arxiv.org/abs/2604.22730) |
|
|
| ## Citation |
|
|
| If you use BantuMorph in your work, please cite: |
|
|
| ```bibtex |
| @misc{mutisya2026bantumorph, |
| title = {BantuMorph: A Character-Level Transformer for Morphological Analysis Across 16 Bantu Languages}, |
| author = {Hillary Mutisya and John Mugane}, |
| year = {2026}, |
| note = {Forthcoming on arXiv. Model available at \url{https://huggingface.co/thiomi/bantumorph-v7}} |
| } |
| ``` |
|
|
| ## Model card authors |
|
|
| Hillary Mutisya, John Mugane |
|
|
| ## Contact |
|
|
| For issues, questions, or collaboration, please open an issue on the model repository or contact the authors directly. |
|
|