license: apache-2.0
AfriBERT Kenya β Domain-Adapted Language Model
Rogendo/afribert-kenya-adapted is a continued pre-training of castorini/afriberta_large on a Kenyan language corpus using Masked Language Modeling (MLM).
It is optimised for Kenyan text: formal Swahili, Nairobi Sheng, M-PESA financial language, CPIMS child-protection terminology, and English-Swahili code-switching as used in everyday Kenyan communication.
What is Domain-Adaptive Pre-Training (DAPT)?
Standard AfriBERT was trained on African newswire and Wikipedia. While it understands Swahili well, it has never seen:
- Sheng slang (
msee,poa,si poa,sawa kabisa) - M-PESA vocabulary (
Fuliza,Lipa na M-PESA,float,till number) - CPIMS child-protection terminology (
ustawi wa jamii,OVC,case worker,safe house) - Kenyan WhatsApp code-switching patterns
DAPT is an intermediate training step between the base pretrained model and task-specific fine-tuning. It continues MLM pre-training on domain text so the model builds a richer internal representation of these patterns before learning any downstream task.
Training Data
The model was trained on five complementary sources totalling approximately 39 million tokens:
| # | Source | Type | Est. Tokens | Repeat | Purpose |
|---|---|---|---|---|---|
| 1 | Swahili Wikipedia (wikimedia/wikipedia, 20231101.sw) |
Encyclopedic prose | ~22M | Γ1 | Foundational standard Swahili β proper nouns, formal syntax, factual text |
| 2 | MasakhaNEWS (masakhane/masakhanews, swa) |
East African journalism | ~1M | Γ3 | Formal East African reporting style; Kenyan political, economic, social vocabulary |
| 3 | Synthetic Sheng/Code-switch corpus (master_mlm_corpus.txt) |
Synthetic | ~1M | Γ10 | Nairobi Sheng, M-PESA transactions, CPIMS case notes, English-Swahili switches |
| 4 | WhatsApp CPIMS chat (field worker exports) | Real conversational | ~30K | Γ20 | Authentic CPIMS field worker language β the highest-value domain signal |
Note: CC-100 Swahili (uonlp/CulturaX) was available but disabled in the final run; sources 3 and 4 were repeated at high frequency so the model sees Kenyan domain text proportionally more than generic Wikipedia.
Synthetic Corpus (Source 3)
master_mlm_corpus.txt is a hand-crafted synthetic corpus covering:
- M-PESA transactions β sending, receiving, Fuliza overdraft, Lipa na M-PESA, Buy Goods
- CPIMS case language β intake forms, referrals, OVC (Orphans and Vulnerable Children), safe-house placements, court orders
- Sheng vocabulary β Nairobi urban slang integrated into Swahili sentences
- English-Swahili code-switching β meeting minutes, office messages, WhatsApp style
WhatsApp CPIMS Chat (Source 4)
Real WhatsApp export from a CPIMS field support group (whatsappchat-Bungoma.txt). Messages were filtered to remove media attachments and very short messages (<20 characters). This source was up-sampled Γ20 because it contains the highest-quality real-world signal for the target domain despite its small size.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | castorini/afriberta_large |
| Training objective | Masked Language Modeling (MLM) |
| Masking probability | 15% |
| Block / sequence length | 128 tokens |
| Batch size | 64 (NVIDIA A40, bf16) |
| Epochs | 3 |
| Learning rate | 1e-4 |
| Weight decay | 0.01 |
| Warmup | 6% of total steps |
| Hardware | NVIDIA A40 (48 GB VRAM) |
| Precision | bfloat16 |
| Training time | ~25.7 minutes |
| Eval split | 5% held-out |
Results
All evaluations were run on CPU using castorini/afriberta_large (base) vs Rogendo/afribert-kenya-adapted (adapted). Pseudo-perplexity is computed via sequential token masking β each token in the sentence is masked one at a time and the model's log-probability for the correct token is accumulated.
MLM Perplexity (lower = better)
| Domain | Sentence (truncated) | Base PPL | Adapted PPL | Ξ |
|---|---|---|---|---|
| M-PESA | "Tuma pesa kwa kutumia nambari ya simu..." | 2.4 | 2.7 | +0.3 β |
| CPIMS child protection | "Mtoto aliripotiwa kwa ofisi ya ustawi wa jamii..." | 8.0 | 7.8 | β0.2 β |
| Sheng / Nairobi urban | "Msee alikuwa poa sana, akanisaidia kupata kazi..." | 11.3 | 3.8 | β7.5 β |
| East African news | "Serikali imetangaza mpango mpya wa kukuza uchumi..." | 3.3 | 3.1 | β0.2 β |
| Standard Swahili | "Akiolojia ni somo linalohusu mabaki ya tamaduni..." | 6.8 | 4.0 | β2.8 β |
| English-Swahili code-switch | "Tulifanya meeting jana na manager akasema project..." | 28.6 | 16.9 | β11.7 β |
| Child welfare | "Watoto wengi wanakabiliwa na changamoto za elimu..." | 2.0 | 1.9 | β0.1 β |
| Financial savings | "Ninahitaji kuweka akiba yangu salama kupitia akaunti..." | 5.1 | 6.7 | +1.6 β |
| Average | 8.4 | 5.9 | β2.6 (30.4% improvement) |
Final MLM training perplexity: 5.39 (3 epochs, evaluated on 5% held-out set)
The two sentences where the adapted model is marginally worse (M-PESA and financial savings) both contain very common, unambiguous Swahili β the base model already predicts them near-perfectly. The largest gains are exactly where expected: Sheng (β66%) and English-Swahili code-switching (β41%).
Masked Token Prediction
Top-5 predictions per test, comparing base vs adapted model:
[Standard Swahili β Wikipedia style]
Akiolojia ni somo linalohusu mabaki ya [tamaduni] za watu wa nyakati zilizopita.
| Rank | Base AfriBERT | Score | Adapted | Score |
|---|---|---|---|---|
| 1 | tabia | 0.175 | tabia | 0.363 |
| 2 | fikra | 0.159 | picha | 0.108 |
| 3 | kazi | 0.065 | jamii | 0.085 |
| 4 | jamii | 0.059 | roho | 0.046 |
| 5 | akili | 0.056 | kazi | 0.027 |
Both models agree on tabia; adapted is more confident (0.363 vs 0.175).
[East African news β formal]
Serikali imetangaza mpango mpya wa kukuza [uchumi] wa taifa kupitia biashara ya kimataifa.
| Rank | Base AfriBERT | Score | Adapted | Score |
|---|---|---|---|---|
| 1 | uchumi | 0.985 | uchumi | 0.973 |
| 2 | utalii | 0.005 | pato | 0.011 |
| 3 | pato | 0.002 | utalii | 0.007 |
Both models nail the correct answer with very high confidence β standard formal Swahili is well-represented in both.
[M-PESA domain β financial]
Tuma [pesa] kwa kutumia nambari ya simu kupitia huduma ya M-PESA.
| Rank | Base AfriBERT | Score | Adapted | Score |
|---|---|---|---|---|
| 1 | ujumbe (message) | 0.122 | simu | 0.210 |
| 2 | neno (word) | 0.092 | twe | 0.093 |
| 3 | malipo | 0.079 | pesa β | 0.053 |
| 4 | simu | 0.070 | sana | 0.035 |
| 5 | nasi | 0.025 | pia | 0.027 |
Adapted model places pesa in top-3; base model puts ujumbe (message) first β showing it doesn't understand M-PESA transaction context.
[CPIMS domain β child protection]
Mtoto aliripotiwa kwa ofisi ya [ustawi] wa jamii baada ya kudhulumiwa nyumbani.
| Rank | Base AfriBERT | Score | Adapted | Score |
|---|---|---|---|---|
| 1 | ustawi | 0.908 | ustawi | 0.943 |
| 2 | Ustawi | 0.025 | mkuu | 0.010 |
| 3 | mfuko | 0.010 | usalama | 0.009 |
Both models strongly predict ustawi β child welfare language appears in Wikipedia. Adapted model is slightly more confident.
[Sheng / code-switching β Nairobi urban]
Msee alikuwa poa sana, akanisaidia kupata [kazi] ya ofisi.
| Rank | Base AfriBERT | Score | Adapted | Score |
|---|---|---|---|---|
| 1 | huduma | 0.164 | pesa | 0.189 |
| 2 | majukumu | 0.057 | emergency | 0.185 |
| 3 | sehemu | 0.055 | huduma | 0.069 |
| 4 | mahitaji | 0.041 | elimu | 0.029 |
| 5 | kazi β | 0.037 | kazi β | 0.019 |
Base model puts kazi at rank 5 (3.7%). Adapted model's top predictions (pesa, emergency) reflect the CPIMS domain context β the model has learned that msee in an urban/office context relates to financial/emergency help.
[WhatsApp CPIMS β field worker message]
Mtoto huyu ana umri wa miaka kumi na mbili na anahitaji [msaada] wa haraka.
| Rank | Base AfriBERT | Score | Adapted | Score |
|---|---|---|---|---|
| 1 | msaada | 0.774 | msaada | 0.892 |
| 2 | upasuaji | 0.120 | usaidizi | 0.030 |
| 3 | ushauri | 0.042 | upasuaji | 0.022 |
Both models strongly predict msaada (help/assistance). Adapted model is significantly more confident (0.892 vs 0.774) β it has seen this exact phrasing repeatedly in WhatsApp CPIMS data.
[English-Swahili code-switch]
Tulifanya meeting jana na manager akasema [project] itakuwa ready wiki ijayo.
| Rank | Base AfriBERT | Score | Adapted | Score |
|---|---|---|---|---|
| 1 | timu (team/sports) | 0.146 | system | 0.334 |
| 2 | ligi (league) | 0.048 | team | 0.104 |
| 3 | klabu (club) | 0.043 | family | 0.041 |
| 4 | Arsenal β½ | 0.033 | process | 0.034 |
| 5 | kazi | 0.033 | salary | 0.022 |
The clearest demonstration of domain shift. Base model interprets meeting + manager as football context (Arsenal, league, club). Adapted model correctly understands it as an office/work context β system, team, process, salary are all semantically appropriate English loanwords.
Downstream Use β CPIMS Multi-Task Classifier
This model was used as the base encoder for Rogendo/cpims-nlp-intent-urgency, a multi-task classifier trained on 1,465 CPIMS support messages to predict:
- Intent (63 classes): login issues, password reset, data entry, escaped children, arrests, referrals, etc.
- Urgency (3 classes): high / medium / low
Results after full fine-tuning on the adapted base:
| Task | F1 Score |
|---|---|
| Intent classification (63 classes) | 74.5% |
| Urgency classification | 84.8% |
Compared to the previous version trained on distilbert-base-uncased with 271 rows: Intent F1 went from 46% β 74.5%.
Use Cases & Practical Domains
This model is designed for any NLP task involving Kenyan language text. It provides a stronger starting point than a generic multilingual model wherever the input contains Swahili, Sheng, code-switching, or Kenyan institutional vocabulary.
1. Child Protection & Social Work (CPIMS)
The primary motivation for this model. Kenya's Child Protection Information Management System (CPIMS) generates a high volume of support requests, case notes, and field reports written by social workers, case managers, and NGO staff β often in a mix of English, Swahili, and Sheng.
Practical tasks:
| Task | Description | Example input |
|---|---|---|
| Help-desk intent classification | Route incoming support messages to the correct team or knowledge base article | "Siwezi kuingia system, password yangu imekwisha" β PasswordReset |
| Urgency triage | Flag messages that need immediate human escalation (child at risk, abuse, missing child) | "Mtoto amekimbia safe house usiku huu" β urgent |
| Case note sentiment | Detect frustration or distress in field worker messages to trigger supervisor review | "Nimejaribu mara nyingi kupata msaada lakini hakuna anayejibu" β negative |
| Entity extraction (NER) | Extract names, locations, case IDs, and child ages from free-text case notes | "Amina, miaka 9, Kibera, Case ID CP-2024-0471" |
| Automated case routing | Predict which department or OVC program a case should be assigned to | Based on case note text |
2. Financial Services & M-PESA
M-PESA is Kenya's dominant mobile money platform used by over 30 million Kenyans. Customer support queries, fraud reports, and transaction disputes are frequently written in Swahili or code-switched language that generic models mishandle.
Practical tasks:
| Task | Description | Example input |
|---|---|---|
| Transaction dispute classification | Categorise dispute type: wrong number, reversal, Fuliza, till payment, paybill | "Nilituma pesa nambari mbaya, naomba reverse" |
| Fraud signal detection | Detect social-engineering scripts, phishing attempts, SIM-swap language | "Uko na nambari ya siri ya M-PESA? Niambie utatumia" |
| Customer sentiment analysis | Measure customer satisfaction from M-PESA helpline transcripts | Post-interaction classification |
| FAQ intent matching | Match a customer query to the nearest self-service FAQ answer | Semantic similarity over a FAQ corpus |
| Agent response quality scoring | Score whether a customer service agent's response was appropriate | Given query + response pairs |
3. Healthcare & Community Health Workers (CHWs)
Community Health Workers in Kenya file visit reports and referral notes, often verbally transcribed or typed on low-end phones in mixed Swahili/English.
Practical tasks:
| Task | Description | Example input |
|---|---|---|
| Symptom extraction | Extract reported symptoms from CHW visit notes | "Mtoto ana homa kali na kukohoa sana tangu jana" |
| Referral urgency classification | Triage referral notes: emergency, routine, follow-up | "Mama mjamzito ana maumivu makali, nahitaji ambulance sasa" β emergency |
| Facility routing | Predict whether a patient should go to dispensary, health centre, or county hospital | Based on symptom description |
| Health campaign text classification | Classify community feedback on health campaigns (vaccination, family planning) | SMS/WhatsApp response categorisation |
4. Education & EdTech
Kenya's education sector uses a blend of English instruction and Swahili explanation, especially in lower grades. Many EdTech platforms serving rural Kenya receive student questions in Sheng or code-switched text.
Practical tasks:
| Task | Description | Example input |
|---|---|---|
| Student question topic classification | Route a question to the right subject tutor or resource | "Sijui kusolve equation hii, pia sina calculator" |
| Learner frustration detection | Flag messages indicating confusion or disengagement | "Sielewi hata kidogo, imefail mara tatu" |
| Automatic feedback categorisation | Classify teacher or parent feedback on school platforms | SMS / app reviews |
| Readability scoring | Score educational content for appropriateness at different grade levels | Given a paragraph of Swahili text |
5. Government & Civic Services
Kenya's e-citizen platforms, county service desks, and public feedback systems receive queries and complaints in everyday Kenyan language.
Practical tasks:
| Task | Description | Example input |
|---|---|---|
| Service request classification | Route citizen petitions/complaints to the correct county department | "Barabara ya kwetu ina mashimo makubwa sana, lini mtarekebisha?" |
| Complaint sentiment & severity | Detect strongly negative or potentially viral citizen complaints | Social media monitoring |
| Language identification | Detect whether a message is Swahili, Sheng, English, or code-switched | Pre-routing in multi-language systems |
| Policy document Q&A | Answer questions grounded in Swahili government policy documents | Retrieval-augmented generation (RAG) with this encoder |
6. Media, Social Listening & Misinformation
Twitter/X, Facebook, and WhatsApp in Kenya carry a large volume of Kenyan Sheng and code-switched content that standard multilingual models struggle to classify.
Practical tasks:
| Task | Description | Example input |
|---|---|---|
| Hate speech / harmful content detection | Detect Sheng-coded hate speech or incitement that generic models miss | Election-period social media moderation |
| Rumour / misinformation flagging | Classify claims as verified, unverified, or disputed | WhatsApp forward monitoring |
| Topic classification | Assign news articles or social posts to categories (politics, economy, sports, health) | Media monitoring dashboards |
| Sentiment analysis | Measure public sentiment on policy announcements, brands, or events | Code-switched Twitter/X data |
Fine-tuning Guide
This model can be fine-tuned with as few as 200β500 labelled examples per class for most classification tasks, because DAPT has already adapted the internal representations to the target domain.
Recommended fine-tuning tasks by architecture
| Architecture | Suitable for | HuggingFace class |
|---|---|---|
| Sequence classification | Intent, sentiment, urgency, topic, routing | AutoModelForSequenceClassification |
| Token classification | NER (names, locations, case IDs, symptoms) | AutoModelForTokenClassification |
| Multi-task (shared encoder + multiple heads) | Intent + urgency simultaneously | Custom (see jenga_ai SDK) |
| Question answering | Policy/FAQ grounding | AutoModelForQuestionAnswering |
| Sentence similarity | Semantic search, FAQ matching | Add a pooling head + contrastive loss |
Minimum data guidelines
| Task complexity | Approx. labelled examples needed |
|---|---|
| Binary classification (2 classes) | 100β300 per class |
| Multi-class (5β15 classes) | 150β400 per class |
| Multi-class (15β63 classes) | 200β500 per class |
| NER (token-level) | 500β1,000 sentences with full annotation |
| Multi-task (2 heads) | Same as above per task head |
These estimates are based on domain-adapted models. A generic multilingual base model would need 3β5Γ more data to reach equivalent performance on Kenyan text.
Fine-tuning with HuggingFace Trainer
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
model_name = "Rogendo/afribert-kenya-adapted"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3 # e.g. urgency: low / medium / high
)
training_args = TrainingArguments(
output_dir = "my-kenya-classifier",
num_train_epochs = 5,
per_device_train_batch_size = 16,
learning_rate = 2e-5, # standard fine-tuning LR
warmup_ratio = 0.1,
evaluation_strategy = "epoch",
save_strategy = "epoch",
load_best_model_at_end = True,
bf16 = True, # use bf16 on A100/A40/H100
)
trainer = Trainer(
model = model,
args = training_args,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
processing_class = tokenizer,
)
trainer.train()
Fine-tuning with jenga_ai SDK (multi-task)
# cpims_config.yaml
model:
base_model: Rogendo/afribert-kenya-adapted
max_seq_len: 128
tasks:
- name: intent
task_type: multi_class_classification
num_labels: 63
label_column: intent
- name: urgency
task_type: multi_class_classification
num_labels: 3
label_column: urgency
training:
epochs: 5
batch_size: 16
learning_rate: 2.0e-5
output_dir: results/cpims-v2
python -m jenga_ai train --config cpims_config.yaml
Usage
Single mask prediction
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# Real Sheng sentence β single mask
results = fill_mask(f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. Uyo msee aliiba doh zangu most.")
for r in results:
print(f"{r['token_str']:<20} {r['score']:.3f}")
Multiple masks (one position at a time)
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# Multiple [MASK] tokens β pipeline returns a list of lists, one per mask position
results = fill_mask(
f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. "
f"Uyo msee ameniibia {tokenizer.mask_token} zangu mingi sana nikimpata "
f"{tokenizer.mask_token} sana, hadi atawacha kunibeba ufala."
)
for mask_predictions in results:
print("--- New Mask ---")
for r in mask_predictions:
print(f"{r['token_str']:<20} {r['score']:.3f}")
As a base model for fine-tuning (jenga_ai SDK)
# experiment_config.yaml
model:
base_model: Rogendo/afribert-kenya-adapted
max_seq_len: 128
tasks:
- name: intent
task_type: multi_class_classification
num_labels: 63
- name: urgency
task_type: multi_class_classification
num_labels: 3
Limitations
- Not suitable for formal Standard Swahili tasks alone β the up-sampling of Sheng and code-switched text slightly shifts the model away from pure encyclopedic Swahili. Use
castorini/afriberta_largedirectly for tasks that only involve formal Swahili prose. - Sheng is not standardised β spelling varies by writer; the model reflects the patterns in the training WhatsApp data which may not generalise to all Sheng dialects (Mombasa Sheng differs from Nairobi Sheng).
- Small WhatsApp corpus β source 4 (real CPIMS field chat) is only ~30K tokens before repetition. Up-sampling compensates but does not replace volume.
- Private model β the model is currently private on HuggingFace Hub. Access requires a token with read permission on the
Rogendoorganisation.
Citation
If you use this model, please cite the base model:
@inproceedings{ogueji-etal-2021-small,
title = {Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages},
author = {Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy},
booktitle = {Proceedings of the 1st Workshop on Multilingual Representation Learning},
year = {2021},
}
Author
Rogendo β built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.
- Downloads last month
- 70