sokho2
/

wolof-lm-kenlm

speech-recognition

Model card Files Files and versions

Wolof KenLM Language Model

A 5-gram KenLM language model for Wolof, designed for ASR rescoring with Whisper.

Key Results

WER Improvement: 5.01% absolute (9.50% relative)
Without LM: WER 52.78%, CER 35.23%
With LM: WER 47.77%, CER 32.11%

Repository Structure

corpus/ - Text corpora (wolof_corpus_v3.txt recommended)
models/ - KenLM .bin and .arpa files (wolof_lm_v3.bin recommended)
scripts/ - Python scripts for training and inference
results/ - Evaluation logs and metrics
books/ - Source literary texts

Model Versions

Version	Data	Sentences	Words	File
V3	Books + translations	86,591	985k	wolof_lm_v3.bin

Usage

import kenlm
lm = kenlm.Model("wolof_lm_v3.bin")
score = lm.score("maa ngi ci jamm", bos=True, eos=True)

Inference with Whisper

python whisper_with_lm.py \
    --audio audio.mp3 \
    --model_path whisper-wolof \
    --lm_path wolof_lm_v3.bin \
    --alpha 0.3

Training Data

Bammeelu Kocc Barma (Boubacar Boris Diop)
Doomi Golo (Boubacar Boris Diop)
galsenai/centralized_wolof_french_translation_data

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support