Wolof-Spelling-Corrector

A fine-tuned version of Oolel-Small-v0.1 trained to convert informal, social-media Wolof into standard orthography.

Wolof is primarily an oral language and most speakers are not formally taught the written standard. Text on platforms like YouTube, WhatsApp, or Facebook is spelled phonetically, heavily influenced by French orthographic conventions, and frequently mixes Wolof with French or English. This means that raw social-media data is essentially unusable for training or evaluating NLP systems without a normalization step.

This model converts informal Wolof text into formal standard orthography while leaving code-switched French and English segments untouched.

Usage

1. HuggingFace pipeline

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="soynade-research/Oolel-Corrector",
    dtype=torch.bfloat16,
    device_map="auto"
)

input = "Normalize this text: Kou guem ni biss dina niew nga am dom bou goor toude ko seydina mouhamat rek lalal bouton j'aime"
messages = [{"role": "user", "content": input}]

result = pipe(
    messages,
    max_new_tokens=512,

)

print(result[0]["generated_text"][-1]["content"])
# Ku gëm ni bés dina ñëw nga am doom bu góor tudde ko Seydina Muhamed rekk laalal butoŋ j'aime.

1. With AutoModel for more control

You can this with system prompts when you want to control the model's behaviour more explicitly, for example to enforce specific output formatting or language instructions.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "soynade-research/Oolel-Corrector"
device = "cuda"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

SYSTEM_PROMPT = (
        "Fix the orthography of this Wolof social media text. "
        "Apply standard rules: correct vowels with diacritics, restore geminates, fix French-influenced spellings. "
        "Keep mixed French/English as-is. Reply only with <CORRECTION>corrected text</CORRECTION>."
    )

def correct(text, system_prompt=SYSTEM_PROMPT, max_new_tokens=512, temperature=0.1):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": text}
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = tokenizer([prompt], return_tensors="pt").to(device)
    generated_ids = model.generate(
        inputs.input_ids,
        max_new_tokens=max_new_tokens,
        temperature=temperature
    )
    generated_ids = [
        output[len(inp):]
        for inp, output in zip(inputs.input_ids, generated_ids)
    ]
    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

text = "man mom khawma sakh li loumou done. Niom gneup dem naniou dakar ngir vacances scolaires yi"
print(correct(text))
# <CORRECTION>Man moom xawma sax li lu mu doon. Ñoom ñépp dem nañu Dakar ngir vacances scolaires yi.</CORRECTION>

Where could you use Oolel-Corrector:

  • Dataset creation and cleaning. Raw Wolof social-media corpora can be normalized at scale before being used for downstream training, annotation, or evaluation.
  • Processing Layer. Any pipeline that operates on Wolof text: sentiment analysis, topic classification, machine translation will perform more consistently on standardized input. This model can serve as a preprocessing step.
  • Keyboard and writing tools. Integrated into a mobile or web interface, the model can suggest standardized spelling to users writing in Wolof, helping close the gap between informal and standard written usage.

Limitations

  • Correction only. This model normalizes orthography, it does not translate. If you want to convert informal spoken-style Wolof directly into French or English, you need to pair this model with a translation model: normalize first, then translate.
Downloads last month
102
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for soynade-research/Oolel-Corrector

Finetuned
(1)
this model

Dataset used to train soynade-research/Oolel-Corrector