Wolof KenLM Language Model
A 5-gram KenLM language model for Wolof, designed for ASR rescoring with Whisper.
Key Results
- WER Improvement: 5.01% absolute (9.50% relative)
- Without LM: WER 52.78%, CER 35.23%
- With LM: WER 47.77%, CER 32.11%
Repository Structure
- corpus/ - Text corpora (wolof_corpus_v3.txt recommended)
- models/ - KenLM .bin and .arpa files (wolof_lm_v3.bin recommended)
- scripts/ - Python scripts for training and inference
- results/ - Evaluation logs and metrics
- books/ - Source literary texts
Model Versions
| Version | Data | Sentences | Words | File |
|---|---|---|---|---|
| V3 | Books + translations | 86,591 | 985k | wolof_lm_v3.bin |
Usage
import kenlm
lm = kenlm.Model("wolof_lm_v3.bin")
score = lm.score("maa ngi ci jamm", bos=True, eos=True)
Inference with Whisper
python whisper_with_lm.py \
--audio audio.mp3 \
--model_path whisper-wolof \
--lm_path wolof_lm_v3.bin \
--alpha 0.3
Training Data
- Bammeelu Kocc Barma (Boubacar Boris Diop)
- Doomi Golo (Boubacar Boris Diop)
- galsenai/centralized_wolof_french_translation_data
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support