Wolof KenLM Language Model

A 5-gram KenLM language model for Wolof, designed for ASR rescoring with Whisper.

Key Results

  • WER Improvement: 5.01% absolute (9.50% relative)
  • Without LM: WER 52.78%, CER 35.23%
  • With LM: WER 47.77%, CER 32.11%

Repository Structure

  • corpus/ - Text corpora (wolof_corpus_v3.txt recommended)
  • models/ - KenLM .bin and .arpa files (wolof_lm_v3.bin recommended)
  • scripts/ - Python scripts for training and inference
  • results/ - Evaluation logs and metrics
  • books/ - Source literary texts

Model Versions

Version Data Sentences Words File
V3 Books + translations 86,591 985k wolof_lm_v3.bin

Usage

import kenlm
lm = kenlm.Model("wolof_lm_v3.bin")
score = lm.score("maa ngi ci jamm", bos=True, eos=True)

Inference with Whisper

python whisper_with_lm.py \
    --audio audio.mp3 \
    --model_path whisper-wolof \
    --lm_path wolof_lm_v3.bin \
    --alpha 0.3

Training Data

  • Bammeelu Kocc Barma (Boubacar Boris Diop)
  • Doomi Golo (Boubacar Boris Diop)
  • galsenai/centralized_wolof_french_translation_data

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support