license: apache-2.0

AfriBERT Kenya β€” Domain-Adapted Language Model

Rogendo/afribert-kenya-adapted is a continued pre-training of castorini/afriberta_large on a Kenyan language corpus using Masked Language Modeling (MLM).

It is optimised for Kenyan text: formal Swahili, Nairobi Sheng, M-PESA financial language, CPIMS child-protection terminology, and English-Swahili code-switching as used in everyday Kenyan communication.


What is Domain-Adaptive Pre-Training (DAPT)?

Standard AfriBERT was trained on African newswire and Wikipedia. While it understands Swahili well, it has never seen:

  • Sheng slang (msee, poa, si poa, sawa kabisa)
  • M-PESA vocabulary (Fuliza, Lipa na M-PESA, float, till number)
  • CPIMS child-protection terminology (ustawi wa jamii, OVC, case worker, safe house)
  • Kenyan WhatsApp code-switching patterns

DAPT is an intermediate training step between the base pretrained model and task-specific fine-tuning. It continues MLM pre-training on domain text so the model builds a richer internal representation of these patterns before learning any downstream task.


Training Data

The model was trained on five complementary sources totalling approximately 39 million tokens:

# Source Type Est. Tokens Repeat Purpose
1 Swahili Wikipedia (wikimedia/wikipedia, 20231101.sw) Encyclopedic prose ~22M Γ—1 Foundational standard Swahili β€” proper nouns, formal syntax, factual text
2 MasakhaNEWS (masakhane/masakhanews, swa) East African journalism ~1M Γ—3 Formal East African reporting style; Kenyan political, economic, social vocabulary
3 Synthetic Sheng/Code-switch corpus (master_mlm_corpus.txt) Synthetic ~1M Γ—10 Nairobi Sheng, M-PESA transactions, CPIMS case notes, English-Swahili switches
4 WhatsApp CPIMS chat (field worker exports) Real conversational ~30K Γ—20 Authentic CPIMS field worker language β€” the highest-value domain signal

Note: CC-100 Swahili (uonlp/CulturaX) was available but disabled in the final run; sources 3 and 4 were repeated at high frequency so the model sees Kenyan domain text proportionally more than generic Wikipedia.

Synthetic Corpus (Source 3)

master_mlm_corpus.txt is a hand-crafted synthetic corpus covering:

  • M-PESA transactions β€” sending, receiving, Fuliza overdraft, Lipa na M-PESA, Buy Goods
  • CPIMS case language β€” intake forms, referrals, OVC (Orphans and Vulnerable Children), safe-house placements, court orders
  • Sheng vocabulary β€” Nairobi urban slang integrated into Swahili sentences
  • English-Swahili code-switching β€” meeting minutes, office messages, WhatsApp style

WhatsApp CPIMS Chat (Source 4)

Real WhatsApp export from a CPIMS field support group (whatsappchat-Bungoma.txt). Messages were filtered to remove media attachments and very short messages (<20 characters). This source was up-sampled Γ—20 because it contains the highest-quality real-world signal for the target domain despite its small size.


Training Configuration

Parameter Value
Base model castorini/afriberta_large
Training objective Masked Language Modeling (MLM)
Masking probability 15%
Block / sequence length 128 tokens
Batch size 64 (NVIDIA A40, bf16)
Epochs 3
Learning rate 1e-4
Weight decay 0.01
Warmup 6% of total steps
Hardware NVIDIA A40 (48 GB VRAM)
Precision bfloat16
Training time ~25.7 minutes
Eval split 5% held-out

Results

All evaluations were run on CPU using castorini/afriberta_large (base) vs Rogendo/afribert-kenya-adapted (adapted). Pseudo-perplexity is computed via sequential token masking β€” each token in the sentence is masked one at a time and the model's log-probability for the correct token is accumulated.

MLM Perplexity (lower = better)

Domain Sentence (truncated) Base PPL Adapted PPL Ξ”
M-PESA "Tuma pesa kwa kutumia nambari ya simu..." 2.4 2.7 +0.3 ↑
CPIMS child protection "Mtoto aliripotiwa kwa ofisi ya ustawi wa jamii..." 8.0 7.8 βˆ’0.2 ↓
Sheng / Nairobi urban "Msee alikuwa poa sana, akanisaidia kupata kazi..." 11.3 3.8 βˆ’7.5 ↓
East African news "Serikali imetangaza mpango mpya wa kukuza uchumi..." 3.3 3.1 βˆ’0.2 ↓
Standard Swahili "Akiolojia ni somo linalohusu mabaki ya tamaduni..." 6.8 4.0 βˆ’2.8 ↓
English-Swahili code-switch "Tulifanya meeting jana na manager akasema project..." 28.6 16.9 βˆ’11.7 ↓
Child welfare "Watoto wengi wanakabiliwa na changamoto za elimu..." 2.0 1.9 βˆ’0.1 ↓
Financial savings "Ninahitaji kuweka akiba yangu salama kupitia akaunti..." 5.1 6.7 +1.6 ↑
Average 8.4 5.9 βˆ’2.6 (30.4% improvement)

Final MLM training perplexity: 5.39 (3 epochs, evaluated on 5% held-out set)

The two sentences where the adapted model is marginally worse (M-PESA and financial savings) both contain very common, unambiguous Swahili β€” the base model already predicts them near-perfectly. The largest gains are exactly where expected: Sheng (βˆ’66%) and English-Swahili code-switching (βˆ’41%).

Masked Token Prediction

Top-5 predictions per test, comparing base vs adapted model:

[Standard Swahili β€” Wikipedia style] Akiolojia ni somo linalohusu mabaki ya [tamaduni] za watu wa nyakati zilizopita.

Rank Base AfriBERT Score Adapted Score
1 tabia 0.175 tabia 0.363
2 fikra 0.159 picha 0.108
3 kazi 0.065 jamii 0.085
4 jamii 0.059 roho 0.046
5 akili 0.056 kazi 0.027

Both models agree on tabia; adapted is more confident (0.363 vs 0.175).


[East African news β€” formal] Serikali imetangaza mpango mpya wa kukuza [uchumi] wa taifa kupitia biashara ya kimataifa.

Rank Base AfriBERT Score Adapted Score
1 uchumi 0.985 uchumi 0.973
2 utalii 0.005 pato 0.011
3 pato 0.002 utalii 0.007

Both models nail the correct answer with very high confidence β€” standard formal Swahili is well-represented in both.


[M-PESA domain β€” financial] Tuma [pesa] kwa kutumia nambari ya simu kupitia huduma ya M-PESA.

Rank Base AfriBERT Score Adapted Score
1 ujumbe (message) 0.122 simu 0.210
2 neno (word) 0.092 twe 0.093
3 malipo 0.079 pesa βœ“ 0.053
4 simu 0.070 sana 0.035
5 nasi 0.025 pia 0.027

Adapted model places pesa in top-3; base model puts ujumbe (message) first β€” showing it doesn't understand M-PESA transaction context.


[CPIMS domain β€” child protection] Mtoto aliripotiwa kwa ofisi ya [ustawi] wa jamii baada ya kudhulumiwa nyumbani.

Rank Base AfriBERT Score Adapted Score
1 ustawi 0.908 ustawi 0.943
2 Ustawi 0.025 mkuu 0.010
3 mfuko 0.010 usalama 0.009

Both models strongly predict ustawi β€” child welfare language appears in Wikipedia. Adapted model is slightly more confident.


[Sheng / code-switching β€” Nairobi urban] Msee alikuwa poa sana, akanisaidia kupata [kazi] ya ofisi.

Rank Base AfriBERT Score Adapted Score
1 huduma 0.164 pesa 0.189
2 majukumu 0.057 emergency 0.185
3 sehemu 0.055 huduma 0.069
4 mahitaji 0.041 elimu 0.029
5 kazi βœ“ 0.037 kazi βœ“ 0.019

Base model puts kazi at rank 5 (3.7%). Adapted model's top predictions (pesa, emergency) reflect the CPIMS domain context β€” the model has learned that msee in an urban/office context relates to financial/emergency help.


[WhatsApp CPIMS β€” field worker message] Mtoto huyu ana umri wa miaka kumi na mbili na anahitaji [msaada] wa haraka.

Rank Base AfriBERT Score Adapted Score
1 msaada 0.774 msaada 0.892
2 upasuaji 0.120 usaidizi 0.030
3 ushauri 0.042 upasuaji 0.022

Both models strongly predict msaada (help/assistance). Adapted model is significantly more confident (0.892 vs 0.774) β€” it has seen this exact phrasing repeatedly in WhatsApp CPIMS data.


[English-Swahili code-switch] Tulifanya meeting jana na manager akasema [project] itakuwa ready wiki ijayo.

Rank Base AfriBERT Score Adapted Score
1 timu (team/sports) 0.146 system 0.334
2 ligi (league) 0.048 team 0.104
3 klabu (club) 0.043 family 0.041
4 Arsenal ⚽ 0.033 process 0.034
5 kazi 0.033 salary 0.022

The clearest demonstration of domain shift. Base model interprets meeting + manager as football context (Arsenal, league, club). Adapted model correctly understands it as an office/work context β€” system, team, process, salary are all semantically appropriate English loanwords.


Downstream Use β€” CPIMS Multi-Task Classifier

This model was used as the base encoder for Rogendo/cpims-nlp-intent-urgency, a multi-task classifier trained on 1,465 CPIMS support messages to predict:

  • Intent (63 classes): login issues, password reset, data entry, escaped children, arrests, referrals, etc.
  • Urgency (3 classes): high / medium / low

Results after full fine-tuning on the adapted base:

Task F1 Score
Intent classification (63 classes) 74.5%
Urgency classification 84.8%

Compared to the previous version trained on distilbert-base-uncased with 271 rows: Intent F1 went from 46% β†’ 74.5%.


Use Cases & Practical Domains

This model is designed for any NLP task involving Kenyan language text. It provides a stronger starting point than a generic multilingual model wherever the input contains Swahili, Sheng, code-switching, or Kenyan institutional vocabulary.

1. Child Protection & Social Work (CPIMS)

The primary motivation for this model. Kenya's Child Protection Information Management System (CPIMS) generates a high volume of support requests, case notes, and field reports written by social workers, case managers, and NGO staff β€” often in a mix of English, Swahili, and Sheng.

Practical tasks:

Task Description Example input
Help-desk intent classification Route incoming support messages to the correct team or knowledge base article "Siwezi kuingia system, password yangu imekwisha" β†’ PasswordReset
Urgency triage Flag messages that need immediate human escalation (child at risk, abuse, missing child) "Mtoto amekimbia safe house usiku huu" β†’ urgent
Case note sentiment Detect frustration or distress in field worker messages to trigger supervisor review "Nimejaribu mara nyingi kupata msaada lakini hakuna anayejibu" β†’ negative
Entity extraction (NER) Extract names, locations, case IDs, and child ages from free-text case notes "Amina, miaka 9, Kibera, Case ID CP-2024-0471"
Automated case routing Predict which department or OVC program a case should be assigned to Based on case note text

2. Financial Services & M-PESA

M-PESA is Kenya's dominant mobile money platform used by over 30 million Kenyans. Customer support queries, fraud reports, and transaction disputes are frequently written in Swahili or code-switched language that generic models mishandle.

Practical tasks:

Task Description Example input
Transaction dispute classification Categorise dispute type: wrong number, reversal, Fuliza, till payment, paybill "Nilituma pesa nambari mbaya, naomba reverse"
Fraud signal detection Detect social-engineering scripts, phishing attempts, SIM-swap language "Uko na nambari ya siri ya M-PESA? Niambie utatumia"
Customer sentiment analysis Measure customer satisfaction from M-PESA helpline transcripts Post-interaction classification
FAQ intent matching Match a customer query to the nearest self-service FAQ answer Semantic similarity over a FAQ corpus
Agent response quality scoring Score whether a customer service agent's response was appropriate Given query + response pairs

3. Healthcare & Community Health Workers (CHWs)

Community Health Workers in Kenya file visit reports and referral notes, often verbally transcribed or typed on low-end phones in mixed Swahili/English.

Practical tasks:

Task Description Example input
Symptom extraction Extract reported symptoms from CHW visit notes "Mtoto ana homa kali na kukohoa sana tangu jana"
Referral urgency classification Triage referral notes: emergency, routine, follow-up "Mama mjamzito ana maumivu makali, nahitaji ambulance sasa" β†’ emergency
Facility routing Predict whether a patient should go to dispensary, health centre, or county hospital Based on symptom description
Health campaign text classification Classify community feedback on health campaigns (vaccination, family planning) SMS/WhatsApp response categorisation

4. Education & EdTech

Kenya's education sector uses a blend of English instruction and Swahili explanation, especially in lower grades. Many EdTech platforms serving rural Kenya receive student questions in Sheng or code-switched text.

Practical tasks:

Task Description Example input
Student question topic classification Route a question to the right subject tutor or resource "Sijui kusolve equation hii, pia sina calculator"
Learner frustration detection Flag messages indicating confusion or disengagement "Sielewi hata kidogo, imefail mara tatu"
Automatic feedback categorisation Classify teacher or parent feedback on school platforms SMS / app reviews
Readability scoring Score educational content for appropriateness at different grade levels Given a paragraph of Swahili text

5. Government & Civic Services

Kenya's e-citizen platforms, county service desks, and public feedback systems receive queries and complaints in everyday Kenyan language.

Practical tasks:

Task Description Example input
Service request classification Route citizen petitions/complaints to the correct county department "Barabara ya kwetu ina mashimo makubwa sana, lini mtarekebisha?"
Complaint sentiment & severity Detect strongly negative or potentially viral citizen complaints Social media monitoring
Language identification Detect whether a message is Swahili, Sheng, English, or code-switched Pre-routing in multi-language systems
Policy document Q&A Answer questions grounded in Swahili government policy documents Retrieval-augmented generation (RAG) with this encoder

6. Media, Social Listening & Misinformation

Twitter/X, Facebook, and WhatsApp in Kenya carry a large volume of Kenyan Sheng and code-switched content that standard multilingual models struggle to classify.

Practical tasks:

Task Description Example input
Hate speech / harmful content detection Detect Sheng-coded hate speech or incitement that generic models miss Election-period social media moderation
Rumour / misinformation flagging Classify claims as verified, unverified, or disputed WhatsApp forward monitoring
Topic classification Assign news articles or social posts to categories (politics, economy, sports, health) Media monitoring dashboards
Sentiment analysis Measure public sentiment on policy announcements, brands, or events Code-switched Twitter/X data

Fine-tuning Guide

This model can be fine-tuned with as few as 200–500 labelled examples per class for most classification tasks, because DAPT has already adapted the internal representations to the target domain.

Recommended fine-tuning tasks by architecture

Architecture Suitable for HuggingFace class
Sequence classification Intent, sentiment, urgency, topic, routing AutoModelForSequenceClassification
Token classification NER (names, locations, case IDs, symptoms) AutoModelForTokenClassification
Multi-task (shared encoder + multiple heads) Intent + urgency simultaneously Custom (see jenga_ai SDK)
Question answering Policy/FAQ grounding AutoModelForQuestionAnswering
Sentence similarity Semantic search, FAQ matching Add a pooling head + contrastive loss

Minimum data guidelines

Task complexity Approx. labelled examples needed
Binary classification (2 classes) 100–300 per class
Multi-class (5–15 classes) 150–400 per class
Multi-class (15–63 classes) 200–500 per class
NER (token-level) 500–1,000 sentences with full annotation
Multi-task (2 heads) Same as above per task head

These estimates are based on domain-adapted models. A generic multilingual base model would need 3–5Γ— more data to reach equivalent performance on Kenyan text.

Fine-tuning with HuggingFace Trainer

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)

model_name = "Rogendo/afribert-kenya-adapted"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3  # e.g. urgency: low / medium / high
)

training_args = TrainingArguments(
    output_dir          = "my-kenya-classifier",
    num_train_epochs    = 5,
    per_device_train_batch_size = 16,
    learning_rate       = 2e-5,       # standard fine-tuning LR
    warmup_ratio        = 0.1,
    evaluation_strategy = "epoch",
    save_strategy       = "epoch",
    load_best_model_at_end = True,
    bf16                = True,       # use bf16 on A100/A40/H100
)

trainer = Trainer(
    model           = model,
    args            = training_args,
    train_dataset   = train_dataset,
    eval_dataset    = eval_dataset,
    processing_class = tokenizer,
)
trainer.train()

Fine-tuning with jenga_ai SDK (multi-task)

# cpims_config.yaml
model:
  base_model: Rogendo/afribert-kenya-adapted
  max_seq_len: 128

tasks:
  - name: intent
    task_type: multi_class_classification
    num_labels: 63
    label_column: intent

  - name: urgency
    task_type: multi_class_classification
    num_labels: 3
    label_column: urgency

training:
  epochs: 5
  batch_size: 16
  learning_rate: 2.0e-5
  output_dir: results/cpims-v2
python -m jenga_ai train --config cpims_config.yaml

Usage

Single mask prediction

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# Real Sheng sentence β€” single mask
results = fill_mask(f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. Uyo msee aliiba doh zangu most.")
for r in results:
    print(f"{r['token_str']:<20} {r['score']:.3f}")

Multiple masks (one position at a time)

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# Multiple [MASK] tokens β€” pipeline returns a list of lists, one per mask position
results = fill_mask(
    f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. "
    f"Uyo msee ameniibia {tokenizer.mask_token} zangu mingi sana nikimpata "
    f"{tokenizer.mask_token} sana, hadi atawacha kunibeba ufala."
)

for mask_predictions in results:
    print("--- New Mask ---")
    for r in mask_predictions:
        print(f"{r['token_str']:<20} {r['score']:.3f}")

As a base model for fine-tuning (jenga_ai SDK)

# experiment_config.yaml
model:
  base_model: Rogendo/afribert-kenya-adapted
  max_seq_len: 128

tasks:
  - name: intent
    task_type: multi_class_classification
    num_labels: 63
  - name: urgency
    task_type: multi_class_classification
    num_labels: 3

Limitations

  • Not suitable for formal Standard Swahili tasks alone β€” the up-sampling of Sheng and code-switched text slightly shifts the model away from pure encyclopedic Swahili. Use castorini/afriberta_large directly for tasks that only involve formal Swahili prose.
  • Sheng is not standardised β€” spelling varies by writer; the model reflects the patterns in the training WhatsApp data which may not generalise to all Sheng dialects (Mombasa Sheng differs from Nairobi Sheng).
  • Small WhatsApp corpus β€” source 4 (real CPIMS field chat) is only ~30K tokens before repetition. Up-sampling compensates but does not replace volume.
  • Private model β€” the model is currently private on HuggingFace Hub. Access requires a token with read permission on the Rogendo organisation.

Citation

If you use this model, please cite the base model:

@inproceedings{ogueji-etal-2021-small,
  title     = {Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages},
  author    = {Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy},
  booktitle = {Proceedings of the 1st Workshop on Multilingual Representation Learning},
  year      = {2021},
}

Author

Rogendo β€” built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.

Downloads last month
70
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Rogendo/afribert-kenya-adapted

Finetuned
(11)
this model
Finetunes
1 model

Collection including Rogendo/afribert-kenya-adapted