Igbo ASR: MMS-1B Adapter + KenLM (28.13% WER)

The highest-accuracy Igbo automatic speech recognition system to date. Fine-tunes only 2.5M parameters (0.25%) of Meta's MMS-1B-all via adapter training, combined with a 5-gram KenLM language model for beam search rescoring.

Paper: From 48% to 28%: Building Igbo ASR with Adapter Fine-Tuning, KenLM Rescoring, and a TTS Feedback Loop

Key Results

Configuration	WER	CER
Stock MMS-1B-all (greedy)	~48%	~16%
Stock MMS + KenLM	38.49%	13.00%
Fine-tuned adapter (greedy)	~35%	~9%
Fine-tuned adapter + KenLM	28.13%	7.26%
NaijaVoices (previous SOTA)	40.28%	—

12-point improvement over the best published Igbo ASR result.

Model Details

Property	Value
Base model	facebook/mms-1b-all
Architecture	wav2vec2 + CTC with adapter layers
Trainable parameters	2.5M (0.25% of 1B total)
Vocabulary	153 characters (expanded from stock 89)
Training data	African Voices Igbo (105,317 clips, ~255 hours, 447 speakers)
Training time	~8.5 hours on 1× H100 80GB
Training cost	~$13
Precision	bfloat16
Critical hyperparameter	`ctc_zero_infinity=True`

KenLM Language Model

Property	Value
Type	5-gram KenLM
Training text	70,033 deduplicated Igbo sentences
Vocabulary	92,486 unique words
Beam search	width=100, α=0.5, β=1.5
Contribution	−7 pp WER improvement

The CTC Stability Fix

Six consecutive training runs crashed with NaN loss before we identified the fix. The root cause: when expanding vocabulary (89 → 153 chars), the randomly initialized classification head creates alignment paths with zero probability → infinite CTC loss → corrupted gradients.

The fix: ctc_zero_infinity=True in PyTorch's CTCLoss. HuggingFace defaults to False; Meta's fairseq always uses True.

Run	Config	Result
v1-v3	Various lr/precision	NaN at step ~900
v4-v6	NaN-safe wrappers	Band-aid failures
v7	bf16 + `ctc_zero_infinity=True`	28.13% WER

Tone-Normalized WER

21.1% of evaluation references contain tone marks that CTC-based ASR cannot produce, inflating WER by ~2.1 pp. We recommend tone-normalized WER as the primary metric:

Normalization	WER	Correction
Raw	28.13%	—
Tone-normalized (primary)	~26.0%	−2.1 pp
Fully-normalized	~25.0%	−3.1 pp

Usage

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from safetensors.torch import load_file
import torch

# Load processor and base model
processor = Wav2Vec2Processor.from_pretrained("path/to/asr_checkpoint")
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/mms-1b-all",
    vocab_size=processor.tokenizer.vocab_size,  # 153, NOT len(tokenizer)
    ignore_mismatched_sizes=True,
)

# Overlay adapter weights
adapter_weights = load_file("path/to/adapter.ibo.safetensors")
state = model.state_dict()
state.update(adapter_weights)
model.load_state_dict(state)
model.eval()

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

# Greedy decoding
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]

# Or beam search with KenLM (recommended, −7pp WER)
from pyctcdecode import build_ctcdecoder
# See GitHub repo for full beam search setup

API

A FastAPI server is available at api/server.py:

POST /transcribe — upload audio → {text, text_toned, duration_seconds, inference_time}
POST /diacriticize — plain text → text with restored Igbo diacritics

Weights

Model weights are not hosted on this repository. See the GitHub repo for access instructions.

Checkpoint files:

adapter.ibo.safetensors (9 MB) — adapter weights (overlay on MMS-1B-all base)
config.json, vocab.json, tokenizer_config.json, preprocessor_config.json — processor configs
igbo_5gram.bin (26 MB) — KenLM 5-gram language model

Hard-Won Lessons

Do NOT use target_lang="ibo" when loading — it loads the stock adapter (89 vocab) which conflicts with our trained adapter (153 vocab)
Use vocab_size=processor.tokenizer.vocab_size (153), NOT len(processor.tokenizer) (155)
bf16 works fine for CTC — the infinity problem is mathematical, not numerical precision
Full fine-tune adds almost nothing: unfreezing all 960M params yields only +0.67 pp (27.46% WER). The bottleneck is data diversity and LM quality, not model capacity.

License

This model is released under CC-BY-NC-SA 4.0.

BY: You must give appropriate credit
NC: Non-commercial use only (upstream MMS-1B-all is CC-BY-NC 4.0)
SA: Derivatives must use the same license

Citation

@misc{chimezie2026igboasr,
  title={From 48\% to 28\%: Building Igbo ASR with Adapter Fine-Tuning, KenLM Rescoring, and a TTS Feedback Loop},
  author={Chimezie, Emmanuel},
  year={2026},
  url={https://github.com/chimezie90/igbotts}
}

Author

Emmanuel Chimezie — Mexkoy Labs

Downloads last month: -; Downloads are not tracked for this model. How to track