SENTRA — SMS Fraud Detector (DistilBERT Multilingual)
SENTRA is a fine-tuned DistilBERT multilingual model for SMS fraud detection (smishing). It classifies SMS messages as either LEGITIMATE or FRAUD with high accuracy.
Model Description
| Property | Value |
|---|---|
| Base model | distilbert-base-multilingual-cased |
| Task | Binary text classification (fraud vs. legitimate) |
| Languages | English, Portuguese, French (+ 101 others via multilingual base) |
| Parameters | 66M |
| Format | SafeTensors |
| License | MIT |
This model was developed as part of the SENTRA ML project to combat SMS fraud (smishing) in Benin and West Africa, where mobile money services are the primary digital payment method.
Training
Dataset
| Source | Language | Size |
|---|---|---|
| UCI SMS Spam Collection | English | ~5,600 SMS |
| MOZ-Smishing | Portuguese | ~1,000 SMS |
| Total (after dedup) | Multilingual | 7,663 SMS |
NLP Preprocessing
Before training, SMS messages were cleaned with:
- SMS abbreviation expansion (~55 abbreviations in EN + FR):
ur→your,slt→salut - Currency normalization:
$5000,50000 FCFA→MONEY_AMOUNT - Repeated character normalization:
freeee→free - URL, email, and phone number removal
Training Details
| Hyperparameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Epochs | 3 |
| Batch size | 16 |
| Max sequence length | 128 tokens |
| Loss function | Weighted cross-entropy (class weights: legit=0.589, fraud=3.299) |
| Early stopping patience | 2 epochs |
| Optimizer | AdamW |
The weighted loss addresses class imbalance (~85% legitimate, ~15% fraud), penalizing missed frauds 5.6× more than false positives.
Results
DistilBERT (this model)
| Metric | Score |
|---|---|
| Accuracy | 99.1% |
| Precision | 95.8% |
| Recall | 98.3% |
| F1-Score | 97.0% |
Improvement over baseline (no NLP preprocessing)
| Metric | Before | After | Gain |
|---|---|---|---|
| Recall | 95.7% | 98.3% | +2.6 |
| F1-Score | 96.9% | 97.0% | +0.1 |
Recall is the priority metric: missing a fraud (false negative) is far more dangerous than a false positive. Our 98.3% recall means only 1.7% of frauds slip through.
Usage
Quick Start (Transformers)
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="VynoDePal/sentra-sms-fraud-detector",
top_k=None,
)
# Fraudulent SMS
result = classifier("URGENT: Your account has been suspended. Call +229-12345678 NOW")
print(result)
# [[{'label': 'FRAUD', 'score': 0.92}, {'label': 'LEGITIMATE', 'score': 0.08}]]
# Legitimate SMS
result = classifier("Hey! Are we still on for lunch tomorrow at 2pm?")
print(result)
# [[{'label': 'LEGITIMATE', 'score': 0.97}, {'label': 'FRAUD', 'score': 0.03}]]
Manual Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "VynoDePal/sentra-sms-fraud-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Congratulations! You won 1,000,000 FCFA. Send your PIN to claim."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.softmax(outputs.logits, dim=-1)
fraud_prob = probabilities[0][1].item()
print(f"Fraud probability: {fraud_prob:.1%}")
# Fraud probability: 87.3%
With Preprocessing (recommended for best results)
For optimal performance, apply the same preprocessing used during training:
import re
SMS_ABBREVIATIONS = {
"u": "you", "ur": "your", "pls": "please", "plz": "please",
"acc": "account", "acct": "account", "asap": "as soon as possible",
"msg": "message", "txt": "text", "amt": "amount",
"slt": "salut", "bjr": "bonjour", "stp": "s il te plait",
"svp": "s il vous plait", "mrc": "merci", "cpte": "compte",
}
CURRENCY_PATTERN = re.compile(
r'[£$€]\s?\d+[,.]?\d*|\d+[,.]?\d*\s?(?:usd|eur|gbp|fcfa|cfa|xof)',
re.IGNORECASE,
)
REPEATED_CHARS = re.compile(r'(.)\1{2,}')
def preprocess_sms(text: str) -> str:
text = text.lower()
text = CURRENCY_PATTERN.sub("money amount", text)
text = re.sub(r'http\S+|www\.\S+', '', text)
words = text.split()
words = [SMS_ABBREVIATIONS.get(w, w) for w in words]
text = ' '.join(words)
text = REPEATED_CHARS.sub(r'\1\1', text)
return text.strip()
# Usage
raw_sms = "URGENT!!! Ur acc has been SUSPENDED. Call NOW to claim $5000"
clean_sms = preprocess_sms(raw_sms)
result = classifier(clean_sms)
Ensemble Model (SENTRA Production)
In the full SENTRA system, this DistilBERT model is combined with a Random Forest classifier using weighted voting for even more robust predictions:
Ensemble = 0.65 × DistilBERT + 0.35 × Random Forest
The Random Forest model and the full API are available in the SENTRA ML repository.
Labels
| Label | ID | Description |
|---|---|---|
LEGITIMATE |
0 | Normal, safe SMS |
FRAUD |
1 | Fraudulent / smishing SMS |
Limitations
- Training data: Primarily English and Portuguese SMS. French is supported via multilingual base model but with fewer training examples.
- Regional focus: Optimized for West African smishing patterns (mobile money, FCFA, MTN/Moov).
- SMS length: Best for messages ≤128 tokens (~200 characters).
- No local languages: Does not cover Fon, Yoruba, or other Beninese languages.
Citation
@misc{sentra2026,
title={SENTRA: SMS Fraud Detection with Ensemble DistilBERT and Random Forest},
author={SENTRA Team},
year={2026},
url={https://github.com/VynoDePal/sentra_ml}
}
- Downloads last month
- 2
Datasets used to train VynoDePal/sentra-sms-fraud-detector
Evaluation results
- Accuracyself-reported0.991
- F1self-reported0.970
- Precisionself-reported0.958
- Recallself-reported0.983