Igbo ASR: MMS-1B Adapter + KenLM (28.13% WER)

The highest-accuracy Igbo automatic speech recognition system to date. Fine-tunes only 2.5M parameters (0.25%) of Meta's MMS-1B-all via adapter training, combined with a 5-gram KenLM language model for beam search rescoring.

Paper: From 48% to 28%: Building Igbo ASR with Adapter Fine-Tuning, KenLM Rescoring, and a TTS Feedback Loop

Key Results

Configuration WER CER
Stock MMS-1B-all (greedy) ~48% ~16%
Stock MMS + KenLM 38.49% 13.00%
Fine-tuned adapter (greedy) ~35% ~9%
Fine-tuned adapter + KenLM 28.13% 7.26%
NaijaVoices (previous SOTA) 40.28% β€”

12-point improvement over the best published Igbo ASR result.

Model Details

Property Value
Base model facebook/mms-1b-all
Architecture wav2vec2 + CTC with adapter layers
Trainable parameters 2.5M (0.25% of 1B total)
Vocabulary 153 characters (expanded from stock 89)
Training data African Voices Igbo (105,317 clips, ~255 hours, 447 speakers)
Training time ~8.5 hours on 1Γ— H100 80GB
Training cost ~$13
Precision bfloat16
Critical hyperparameter ctc_zero_infinity=True

KenLM Language Model

Property Value
Type 5-gram KenLM
Training text 70,033 deduplicated Igbo sentences
Vocabulary 92,486 unique words
Beam search width=100, Ξ±=0.5, Ξ²=1.5
Contribution βˆ’7 pp WER improvement

The CTC Stability Fix

Six consecutive training runs crashed with NaN loss before we identified the fix. The root cause: when expanding vocabulary (89 β†’ 153 chars), the randomly initialized classification head creates alignment paths with zero probability β†’ infinite CTC loss β†’ corrupted gradients.

The fix: ctc_zero_infinity=True in PyTorch's CTCLoss. HuggingFace defaults to False; Meta's fairseq always uses True.

Run Config Result
v1-v3 Various lr/precision NaN at step ~900
v4-v6 NaN-safe wrappers Band-aid failures
v7 bf16 + ctc_zero_infinity=True 28.13% WER

Tone-Normalized WER

21.1% of evaluation references contain tone marks that CTC-based ASR cannot produce, inflating WER by ~2.1 pp. We recommend tone-normalized WER as the primary metric:

Normalization WER Correction
Raw 28.13% β€”
Tone-normalized (primary) ~26.0% βˆ’2.1 pp
Fully-normalized ~25.0% βˆ’3.1 pp

Usage

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from safetensors.torch import load_file
import torch

# Load processor and base model
processor = Wav2Vec2Processor.from_pretrained("path/to/asr_checkpoint")
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/mms-1b-all",
    vocab_size=processor.tokenizer.vocab_size,  # 153, NOT len(tokenizer)
    ignore_mismatched_sizes=True,
)

# Overlay adapter weights
adapter_weights = load_file("path/to/adapter.ibo.safetensors")
state = model.state_dict()
state.update(adapter_weights)
model.load_state_dict(state)
model.eval()

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

# Greedy decoding
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]

# Or beam search with KenLM (recommended, βˆ’7pp WER)
from pyctcdecode import build_ctcdecoder
# See GitHub repo for full beam search setup

API

A FastAPI server is available at api/server.py:

  • POST /transcribe β€” upload audio β†’ {text, text_toned, duration_seconds, inference_time}
  • POST /diacriticize β€” plain text β†’ text with restored Igbo diacritics

Weights

Model weights are not hosted on this repository. See the GitHub repo for access instructions.

Checkpoint files:

  • adapter.ibo.safetensors (9 MB) β€” adapter weights (overlay on MMS-1B-all base)
  • config.json, vocab.json, tokenizer_config.json, preprocessor_config.json β€” processor configs
  • igbo_5gram.bin (26 MB) β€” KenLM 5-gram language model

Hard-Won Lessons

  • Do NOT use target_lang="ibo" when loading β€” it loads the stock adapter (89 vocab) which conflicts with our trained adapter (153 vocab)
  • Use vocab_size=processor.tokenizer.vocab_size (153), NOT len(processor.tokenizer) (155)
  • bf16 works fine for CTC β€” the infinity problem is mathematical, not numerical precision
  • Full fine-tune adds almost nothing: unfreezing all 960M params yields only +0.67 pp (27.46% WER). The bottleneck is data diversity and LM quality, not model capacity.

License

This model is released under CC-BY-NC-SA 4.0.

  • BY: You must give appropriate credit
  • NC: Non-commercial use only (upstream MMS-1B-all is CC-BY-NC 4.0)
  • SA: Derivatives must use the same license

Citation

@misc{chimezie2026igboasr,
  title={From 48\% to 28\%: Building Igbo ASR with Adapter Fine-Tuning, KenLM Rescoring, and a TTS Feedback Loop},
  author={Chimezie, Emmanuel},
  year={2026},
  url={https://github.com/chimezie90/igbotts}
}

Author

Emmanuel Chimezie β€” Mexkoy Labs

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support