--- library_name: transformers license: apache-2.0 base_model: google/byt5-small language: - sw - zu - lg - ny - sn - kg - ki - kam - suk - mer - ln - nso - xh - nyf - rn - rw tags: - bantu - morphology - multilingual - low-resource - character-level - byt5 - african-languages - swahili - zulu - kikuyu datasets: - mutisya/bantu-words-26-03-v3.5 --- # BantuMorph v7 BantuMorph is a character-level transformer for morphological analysis across 16 Bantu languages. Given a word in any of the supported languages, it can extract the lemma and morphological features, segment the word into morphemes, predict the noun class, or generate inflected forms from a lemma plus features. The model is trained on 80,765 morphological paradigms across the 16 languages and operates over byte-level input, which lets it handle the rich agglutinative morphology of Bantu languages without word-piece tokenization artifacts. ## Quick summary | Property | Value | |---|---| | Architecture | ByT5-small (encoder-decoder, character-level) | | Parameters | 300M | | Languages | 16 Bantu languages | | Tasks | 5 (extract, segment, lemmatize, nounclass, complete) | | Base model | [`google/byt5-small`](https://huggingface.co/google/byt5-small) | | License | Apache-2.0 | ## Languages | Code | Language | Guthrie zone | Approx. speakers (M) | |------|----------|--------------|---------------------| | swh | Swahili | G42 | 200 | | zul | Zulu | S42 | 12 | | xho | Xhosa | S41 | 8 | | sna | Shona | S10 | 9 | | nso | N. Sotho | S32 | 4 | | nya | Chichewa | N31 | 14 | | kik | Kikuyu | E51 | 8 | | kam | Kamba | E55 | 5 | | mer | Kimeru | E54 | 4 | | nyf | Giriama | E72b | 0.6 | | kin | Kinyarwanda | J61 | 12 | | run | Kirundi | JD62 | 9 | | lug | Luganda | JE15 | 8 | | kon | Kongo | H16 | 5 | | lin | Lingala | C40 | 40 | | suk | Kisukuma | F21 | 5 | ## What the model does BantuMorph supports five morphological tasks, each invoked through a task prefix on the input. ### Task 1 — Extract (lemma + features) Joint lemmatization and feature prediction. ``` Input: swh-extract: ninasoma Output: soma V;PRS;1;SG ``` ### Task 2 — Segment Morpheme boundary detection. ``` Input: swh-segment: ninasoma Output: ni-na-soma ``` ### Task 3 — Lemmatize Extract the citation form. ``` Input: swh-lemmatize: ninasoma Output: soma ``` ### Task 4 — Noun class Predict the Bantu noun class for a noun. ``` Input: swh-nounclass: mtoto Output: BANTU1 ``` ### Task 5 — Complete (inflection) Generate an inflected form from a lemma and features. ``` Input: swh-complete: soma [V;PRS;1;SG] Output: ninasoma ``` ## How to use ```python from transformers import T5ForConditionalGeneration, AutoTokenizer model_id = "thiomi/bantumorph-v7" tokenizer = AutoTokenizer.from_pretrained(model_id) model = T5ForConditionalGeneration.from_pretrained(model_id) def run(prompt: str) -> str: inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128) outputs = model.generate(**inputs, max_new_tokens=64) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Examples print(run("swh-extract: ninasoma")) # 'soma V;PRS;1;SG' print(run("swh-segment: ninasoma")) # 'ni-na-soma' print(run("swh-lemmatize: ninasoma")) # 'soma' print(run("swh-nounclass: mtoto")) # 'BANTU1' print(run("swh-complete: soma [V;PRS;1;SG]")) # 'ninasoma' ``` The task prefix is the language ISO code followed by the task name, separated by a hyphen. The supported language codes are listed in the table above (e.g. `swh-`, `kik-`, `zul-`). ## Evaluation Evaluated on a held-out test set of 4,687 examples spanning all 16 languages and all 5 tasks (~290 examples per language on average, stratified by task). ### Per-task accuracy | Task | Accuracy | |---|---| | segment | **96.1%** | | nounclass | 87.8% | | lemmatize | 82.3% | | complete | 60.7% | | extract | 42.9% | | **Overall** | **67.1%** | ### Per-language accuracy (best to worst) | Language | Accuracy | |---|---| | Shona | 94.4% | | Chichewa | 89.6% | | Luganda | 85.6% | | Swahili | 83.2% | | Kongo | 80.9% | | (most other languages) | 60–80% | | Northern Sotho | 44.7% | For full per-task × per-language breakdown, see the BantuMorph paper. ### Notes on the evaluation - Accuracy is exact-match on the model output. For segmentation specifically, ~45% of "errors" on common training vocabulary are actually valid alternative segmentations rather than incorrect ones — see the BantuMorph paper for the over-segmentation analysis. - Languages with smaller training corpora (Northern Sotho, Xhosa, Kirundi, Kinyarwanda) tend to underperform languages with larger corpora. - The hardest task is `extract` because of the large feature space; the easiest is `segment`. ## Training data BantuMorph v7 was trained on 80,765 morphological paradigms drawn from: - **UniMorph** Bantu paradigm collections for the languages that have them - **LLM-generated paradigm extensions** from related Bantu languages, validated by community linguists - **Cross-lingual transfer paradigms** from high-resource Bantu languages (primarily Swahili, Zulu, and Luganda) Data was split 85% train / 10% validation / 5% test, with care taken to ensure speaker-disjoint and lemma-disjoint splits where possible. ## Limitations - **Not a substitute for native-speaker validation.** The model is a useful starting point for morphological annotation, but generated outputs should be reviewed by speakers or linguists for any high-stakes use. - **Accuracy varies sharply by language.** The 16 languages have very different amounts of training data; performance ranges from ~95% (Shona) to ~45% (Northern Sotho) overall. - **Out-of-distribution loanwords.** The model can over-apply Bantu morphological templates to loanwords from English, Arabic, French, or Portuguese. Filtering loanwords is an open problem; see the related v3.5 dataset for one approach. - **No tone marking.** The model treats text at the byte level and does not explicitly encode lexical tone. For tonal languages like Luganda, tonal distinctions are missing from both input and output. - **Limited orthographic coverage.** Trained on standard Latin orthography for each language. Variant spellings (especially in less-standardized languages) may underperform. - **Single-word inputs.** Each task expects a single word; running on multi-word phrases or full sentences will produce unreliable results. ## Intended use BantuMorph is intended for: - Computational linguistics research on Bantu languages - Prototyping morphological analyzers for under-resourced Bantu languages via cross-lingual transfer - Educational tools that need morphological breakdown (lemmatization, segmentation, noun class) - Pre-processing for downstream NLP pipelines (information retrieval, search, named entity recognition) It is not intended for: - Production speech-to-text or translation systems on its own - Definitive linguistic analysis without human review - Sociolinguistic or dialect-specific analysis ## Related work - **Zero-shot morphological discovery** — applies BantuMorph to Giriama with only 91 labeled paradigms. [arxiv:2604.22723](https://arxiv.org/abs/2604.22723) - **Neural recovery of historical lexical structure** — uses BantuMorph embeddings to recover Proto-Bantu cognate structure. [arxiv:2604.22730](https://arxiv.org/abs/2604.22730) ## Citation If you use BantuMorph in your work, please cite: ```bibtex @misc{mutisya2026bantumorph, title = {BantuMorph: A Character-Level Transformer for Morphological Analysis Across 16 Bantu Languages}, author = {Hillary Mutisya and John Mugane}, year = {2026}, note = {Forthcoming on arXiv. Model available at \url{https://huggingface.co/thiomi/bantumorph-v7}} } ``` ## Model card authors Hillary Mutisya, John Mugane ## Contact For issues, questions, or collaboration, please open an issue on the model repository or contact the authors directly.