Wolof-Spelling-Corrector
A fine-tuned version of Oolel-Small-v0.1 trained to convert informal, social-media Wolof into standard orthography.
Wolof is primarily an oral language and most speakers are not formally taught the written standard. Text on platforms like YouTube, WhatsApp, or Facebook is spelled phonetically, heavily influenced by French orthographic conventions, and frequently mixes Wolof with French or English. This means that raw social-media data is essentially unusable for training or evaluating NLP systems without a normalization step.
This model converts informal Wolof text into formal standard orthography while leaving code-switched French and English segments untouched.
Usage
1. HuggingFace pipeline
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="soynade-research/Oolel-Corrector",
dtype=torch.bfloat16,
device_map="auto"
)
input = "Normalize this text: Kou guem ni biss dina niew nga am dom bou goor toude ko seydina mouhamat rek lalal bouton j'aime"
messages = [{"role": "user", "content": input}]
result = pipe(
messages,
max_new_tokens=512,
)
print(result[0]["generated_text"][-1]["content"])
# Ku gëm ni bés dina ñëw nga am doom bu góor tudde ko Seydina Muhamed rekk laalal butoŋ j'aime.
1. With AutoModel for more control
You can this with system prompts when you want to control the model's behaviour more explicitly, for example to enforce specific output formatting or language instructions.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "soynade-research/Oolel-Corrector"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
SYSTEM_PROMPT = (
"Fix the orthography of this Wolof social media text. "
"Apply standard rules: correct vowels with diacritics, restore geminates, fix French-influenced spellings. "
"Keep mixed French/English as-is. Reply only with <CORRECTION>corrected text</CORRECTION>."
)
def correct(text, system_prompt=SYSTEM_PROMPT, max_new_tokens=512, temperature=0.1):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([prompt], return_tensors="pt").to(device)
generated_ids = model.generate(
inputs.input_ids,
max_new_tokens=max_new_tokens,
temperature=temperature
)
generated_ids = [
output[len(inp):]
for inp, output in zip(inputs.input_ids, generated_ids)
]
return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
text = "man mom khawma sakh li loumou done. Niom gneup dem naniou dakar ngir vacances scolaires yi"
print(correct(text))
# <CORRECTION>Man moom xawma sax li lu mu doon. Ñoom ñépp dem nañu Dakar ngir vacances scolaires yi.</CORRECTION>
Where could you use Oolel-Corrector:
- Dataset creation and cleaning. Raw Wolof social-media corpora can be normalized at scale before being used for downstream training, annotation, or evaluation.
- Processing Layer. Any pipeline that operates on Wolof text: sentiment analysis, topic classification, machine translation will perform more consistently on standardized input. This model can serve as a preprocessing step.
- Keyboard and writing tools. Integrated into a mobile or web interface, the model can suggest standardized spelling to users writing in Wolof, helping close the gap between informal and standard written usage.
Limitations
- Correction only. This model normalizes orthography, it does not translate. If you want to convert informal spoken-style Wolof directly into French or English, you need to pair this model with a translation model: normalize first, then translate.
- Downloads last month
- 102
Model tree for soynade-research/Oolel-Corrector
Base model
soynade-research/Oolel-Small-v0.1