MzansiLM
Collection
South African Language Models - SALLM project for low-resource African language modeling • 3 items • Updated
MzansiLM is a 125M-parameter decoder-only language model trained from scratch on MzansiText, a multilingual corpus covering all eleven official South African languages.
125,008,384LlamaForCausalLM51215363093204810000.01e-5trueflash_attention_2MzansiLM uses a custom BPE tokenizer with a vocabulary size of 65536.
[BOS] = 0[EOS] = 1[PAD] = 2[UNK] = 3NFDByteLevel[BOS] $A [EOS][BOS] $A [EOS] [BOS] $B [EOS]The model was trained on MzansiText and covers all eleven official South African languages:
af, en, nso, sot, ssw, tsn, tso, ven, xho, zul, nbl
Related releases:
MzansiLM is a research model for pretraining, fine-tuning, and evaluation on South African languages. It is intended as a reproducible baseline for language modeling and downstream task adaptation.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("anrilombard/mzansilm-125m")
model = AutoModelForCausalLM.from_pretrained("anrilombard/mzansilm-125m")
inputs = tokenizer("Molo!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Please cite the paper:
@misc{lombard2026mzansitextmzansilmopencorpus,
title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages},
author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys},
year={2026},
eprint={2603.20732},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.20732},
}
Apache License 2.0