SENTRA — SMS Fraud Detector (DistilBERT Multilingual)

SENTRA is a fine-tuned DistilBERT multilingual model for SMS fraud detection (smishing). It classifies SMS messages as either LEGITIMATE or FRAUD with high accuracy.

Model Description

Property	Value
Base model	`distilbert-base-multilingual-cased`
Task	Binary text classification (fraud vs. legitimate)
Languages	English, Portuguese, French (+ 101 others via multilingual base)
Parameters	66M
Format	SafeTensors
License	MIT

This model was developed as part of the SENTRA ML project to combat SMS fraud (smishing) in Benin and West Africa, where mobile money services are the primary digital payment method.

Training

Dataset

Source	Language	Size
UCI SMS Spam Collection	English	~5,600 SMS
MOZ-Smishing	Portuguese	~1,000 SMS
Total (after dedup)	Multilingual	7,663 SMS

NLP Preprocessing

Before training, SMS messages were cleaned with:

SMS abbreviation expansion (~55 abbreviations in EN + FR): ur → your, slt → salut
Currency normalization: $5000, 50000 FCFA → MONEY_AMOUNT
Repeated character normalization: freeee → free
URL, email, and phone number removal

Training Details

Hyperparameter	Value
Learning rate	2e-5
Epochs	3
Batch size	16
Max sequence length	128 tokens
Loss function	Weighted cross-entropy (class weights: legit=0.589, fraud=3.299)
Early stopping patience	2 epochs
Optimizer	AdamW

The weighted loss addresses class imbalance (~85% legitimate, ~15% fraud), penalizing missed frauds 5.6× more than false positives.

Results

DistilBERT (this model)

Metric	Score
Accuracy	99.1%
Precision	95.8%
Recall	98.3%
F1-Score	97.0%

Improvement over baseline (no NLP preprocessing)

Metric	Before	After	Gain
Recall	95.7%	98.3%	+2.6
F1-Score	96.9%	97.0%	+0.1

Recall is the priority metric: missing a fraud (false negative) is far more dangerous than a false positive. Our 98.3% recall means only 1.7% of frauds slip through.

Usage

Quick Start (Transformers)

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="VynoDePal/sentra-sms-fraud-detector",
    top_k=None,
)

# Fraudulent SMS
result = classifier("URGENT: Your account has been suspended. Call +229-12345678 NOW")
print(result)
# [[{'label': 'FRAUD', 'score': 0.92}, {'label': 'LEGITIMATE', 'score': 0.08}]]

# Legitimate SMS
result = classifier("Hey! Are we still on for lunch tomorrow at 2pm?")
print(result)
# [[{'label': 'LEGITIMATE', 'score': 0.97}, {'label': 'FRAUD', 'score': 0.03}]]

Manual Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "VynoDePal/sentra-sms-fraud-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Congratulations! You won 1,000,000 FCFA. Send your PIN to claim."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.softmax(outputs.logits, dim=-1)
    fraud_prob = probabilities[0][1].item()

print(f"Fraud probability: {fraud_prob:.1%}")
# Fraud probability: 87.3%

With Preprocessing (recommended for best results)

For optimal performance, apply the same preprocessing used during training:

import re

SMS_ABBREVIATIONS = {
    "u": "you", "ur": "your", "pls": "please", "plz": "please",
    "acc": "account", "acct": "account", "asap": "as soon as possible",
    "msg": "message", "txt": "text", "amt": "amount",
    "slt": "salut", "bjr": "bonjour", "stp": "s il te plait",
    "svp": "s il vous plait", "mrc": "merci", "cpte": "compte",
}

CURRENCY_PATTERN = re.compile(
    r'[£$€]\s?\d+[,.]?\d*|\d+[,.]?\d*\s?(?:usd|eur|gbp|fcfa|cfa|xof)',
    re.IGNORECASE,
)
REPEATED_CHARS = re.compile(r'(.)\1{2,}')

def preprocess_sms(text: str) -> str:
    text = text.lower()
    text = CURRENCY_PATTERN.sub("money amount", text)
    text = re.sub(r'http\S+|www\.\S+', '', text)
    words = text.split()
    words = [SMS_ABBREVIATIONS.get(w, w) for w in words]
    text = ' '.join(words)
    text = REPEATED_CHARS.sub(r'\1\1', text)
    return text.strip()

# Usage
raw_sms = "URGENT!!! Ur acc has been SUSPENDED. Call NOW to claim $5000"
clean_sms = preprocess_sms(raw_sms)
result = classifier(clean_sms)

Ensemble Model (SENTRA Production)

In the full SENTRA system, this DistilBERT model is combined with a Random Forest classifier using weighted voting for even more robust predictions:

Ensemble = 0.65 × DistilBERT + 0.35 × Random Forest

The Random Forest model and the full API are available in the SENTRA ML repository.

Labels

Label	ID	Description
`LEGITIMATE`	0	Normal, safe SMS
`FRAUD`	1	Fraudulent / smishing SMS

Limitations

Training data: Primarily English and Portuguese SMS. French is supported via multilingual base model but with fewer training examples.
Regional focus: Optimized for West African smishing patterns (mobile money, FCFA, MTN/Moov).
SMS length: Best for messages ≤128 tokens (~200 characters).
No local languages: Does not cover Fon, Yoruba, or other Beninese languages.

Citation

@misc{sentra2026,
  title={SENTRA: SMS Fraud Detection with Ensemble DistilBERT and Random Forest},
  author={SENTRA Team},
  year={2026},
  url={https://github.com/VynoDePal/sentra_ml}
}

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train VynoDePal/sentra-sms-fraud-detector

Evaluation results

Accuracy
self-reported

0.991
F1
self-reported

0.970
Precision
self-reported

0.958
Recall
self-reported

0.983