Igbo ASR: MMS-1B Adapter + KenLM (28.13% WER)
The highest-accuracy Igbo automatic speech recognition system to date. Fine-tunes only 2.5M parameters (0.25%) of Meta's MMS-1B-all via adapter training, combined with a 5-gram KenLM language model for beam search rescoring.
Key Results
| Configuration | WER | CER |
|---|---|---|
| Stock MMS-1B-all (greedy) | ~48% | ~16% |
| Stock MMS + KenLM | 38.49% | 13.00% |
| Fine-tuned adapter (greedy) | ~35% | ~9% |
| Fine-tuned adapter + KenLM | 28.13% | 7.26% |
| NaijaVoices (previous SOTA) | 40.28% | β |
12-point improvement over the best published Igbo ASR result.
Model Details
| Property | Value |
|---|---|
| Base model | facebook/mms-1b-all |
| Architecture | wav2vec2 + CTC with adapter layers |
| Trainable parameters | 2.5M (0.25% of 1B total) |
| Vocabulary | 153 characters (expanded from stock 89) |
| Training data | African Voices Igbo (105,317 clips, ~255 hours, 447 speakers) |
| Training time | ~8.5 hours on 1Γ H100 80GB |
| Training cost | ~$13 |
| Precision | bfloat16 |
| Critical hyperparameter | ctc_zero_infinity=True |
KenLM Language Model
| Property | Value |
|---|---|
| Type | 5-gram KenLM |
| Training text | 70,033 deduplicated Igbo sentences |
| Vocabulary | 92,486 unique words |
| Beam search | width=100, Ξ±=0.5, Ξ²=1.5 |
| Contribution | β7 pp WER improvement |
The CTC Stability Fix
Six consecutive training runs crashed with NaN loss before we identified the fix. The root cause: when expanding vocabulary (89 β 153 chars), the randomly initialized classification head creates alignment paths with zero probability β infinite CTC loss β corrupted gradients.
The fix: ctc_zero_infinity=True in PyTorch's CTCLoss. HuggingFace defaults to False; Meta's fairseq always uses True.
| Run | Config | Result |
|---|---|---|
| v1-v3 | Various lr/precision | NaN at step ~900 |
| v4-v6 | NaN-safe wrappers | Band-aid failures |
| v7 | bf16 + ctc_zero_infinity=True |
28.13% WER |
Tone-Normalized WER
21.1% of evaluation references contain tone marks that CTC-based ASR cannot produce, inflating WER by ~2.1 pp. We recommend tone-normalized WER as the primary metric:
| Normalization | WER | Correction |
|---|---|---|
| Raw | 28.13% | β |
| Tone-normalized (primary) | ~26.0% | β2.1 pp |
| Fully-normalized | ~25.0% | β3.1 pp |
Usage
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from safetensors.torch import load_file
import torch
# Load processor and base model
processor = Wav2Vec2Processor.from_pretrained("path/to/asr_checkpoint")
model = Wav2Vec2ForCTC.from_pretrained(
"facebook/mms-1b-all",
vocab_size=processor.tokenizer.vocab_size, # 153, NOT len(tokenizer)
ignore_mismatched_sizes=True,
)
# Overlay adapter weights
adapter_weights = load_file("path/to/adapter.ibo.safetensors")
state = model.state_dict()
state.update(adapter_weights)
model.load_state_dict(state)
model.eval()
# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
# Greedy decoding
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
# Or beam search with KenLM (recommended, β7pp WER)
from pyctcdecode import build_ctcdecoder
# See GitHub repo for full beam search setup
API
A FastAPI server is available at api/server.py:
POST /transcribeβ upload audio β{text, text_toned, duration_seconds, inference_time}POST /diacriticizeβ plain text β text with restored Igbo diacritics
Weights
Model weights are not hosted on this repository. See the GitHub repo for access instructions.
Checkpoint files:
adapter.ibo.safetensors(9 MB) β adapter weights (overlay on MMS-1B-all base)config.json,vocab.json,tokenizer_config.json,preprocessor_config.jsonβ processor configsigbo_5gram.bin(26 MB) β KenLM 5-gram language model
Hard-Won Lessons
- Do NOT use
target_lang="ibo"when loading β it loads the stock adapter (89 vocab) which conflicts with our trained adapter (153 vocab) - Use
vocab_size=processor.tokenizer.vocab_size(153), NOTlen(processor.tokenizer)(155) - bf16 works fine for CTC β the infinity problem is mathematical, not numerical precision
- Full fine-tune adds almost nothing: unfreezing all 960M params yields only +0.67 pp (27.46% WER). The bottleneck is data diversity and LM quality, not model capacity.
License
This model is released under CC-BY-NC-SA 4.0.
- BY: You must give appropriate credit
- NC: Non-commercial use only (upstream MMS-1B-all is CC-BY-NC 4.0)
- SA: Derivatives must use the same license
Citation
@misc{chimezie2026igboasr,
title={From 48\% to 28\%: Building Igbo ASR with Adapter Fine-Tuning, KenLM Rescoring, and a TTS Feedback Loop},
author={Chimezie, Emmanuel},
year={2026},
url={https://github.com/chimezie90/igbotts}
}
Author
Emmanuel Chimezie β Mexkoy Labs