DeBERTa-v3 Smishing & Spam Detector — v0.2.1

A fine-tuned DeBERTa-v3-base model for detecting smishing (SMS phishing) and spam messages. v0.2.1 is a ground-up revision of the architecture and training pipeline, driven by systematic error analysis of 624 misclassified messages from v0.1.

Model Performance

Metric	v0.1	v0.2.1 (@ 0.50)	v0.2.1 (@ 0.67)
F1 Score	0.9299	0.8781	0.9022
Precision	0.8986	0.8226	0.9058
Recall	0.9634	0.9417	0.8986
AUC-ROC	0.9906	0.9857	0.9857
Best Epoch	6	8	8
Optimal Threshold	0.6993	—	0.6700

Confusion Matrix (test set @ optimised threshold)

                Predicted Benign    Predicted Spam
True Benign             32,136               612
True Spam                  664             5,886

What the numbers don't show

Precision surpassed v0.1 (0.9058 vs 0.8986), and the FP rate dropped from 2.39% to 1.87%. More importantly, the quality of errors changed completely — see Error Analysis Comparison below.

What changed from v0.1

v0.1 achieved 0.93 F1 / 0.99 AUC-ROC but had two systematic problems revealed through error analysis of 624 misclassified test samples:

False positives (341): The model was confidently wrong on legitimate messages — 49 FPs had ≥0.99 spam probability. Bank transaction alerts, retail sale announcements, telecom notifications, and gaming discussions all triggered false alarms because they shared surface-level features (URLs, urgency words, phone numbers) with real spam.

False negatives (283): Short, featureless, or obfuscated spam slipped through. These messages had almost zero signal from the original 15 engineered features — only 1 of 283 FNs had a detected URL, zero had detected phone numbers, and average urgency score was 0.117. Many were truncated, used unicode evasion ("Y ou've got mail: new messa ge"), or were conversational-style scams.

Every architectural and training change in v0.2.1 targets one of these specific failure modes.

Architecture

Input Text
    └─► DeBERTa-v3-base encoder (gradient checkpointing enabled)
            ├─► [CLS] embedding (768d)
            └─► Attention-weighted pooling (768d)     ← NEW: learned attention over all tokens
                                                  ┐
23 Engineered Features                            │
    └─► Linear(23→128) + LayerNorm + GELU         ├─► concat (1664d) → Residual classifier → logits
                                                  ┘

What's new in the architecture

Dual pooling: [CLS] + attention-weighted pooling. v0.1 relied solely on the [CLS] token embedding, which can miss signal in short messages where one suspicious word doesn't dominate the representation. v0.2.1 adds a learned attention pooling layer that computes a weighted sum across all token embeddings. The two representations are concatenated (2×768 = 1536d), giving the classifier both a global summary and a signal-focused view.

8 new engineered features targeting FN blind spots. The original 15 features (char_count, has_url, urgency_score, etc.) produced near-zero signal on false negatives. The 8 new features are designed to catch the specific evasion patterns those FNs used:

Feature	Targets	Description
`unicode_ratio`	Unicode substitution ("Vérífy yøur àccount")	% of non-ASCII characters
`char_entropy`	Short/repetitive spam	Shannon entropy over character distribution
`suspicious_spacing`	Spaced-out evasion ("m e s s a g e")	Count of space-separated character sequences
`leet_ratio`	Character substitution (l33t speak)	% of characters that map to leet translations
`max_digit_run`	Phone numbers, OTPs, account numbers	Longest consecutive digit sequence
`repeated_char_ratio`	"!!!!" or "aaaaaa" patterns	Ratio of consecutive repeated characters
`vocab_richness`	Template spam (low diversity)	Unique words / total words
`has_obfuscated_url`	Broken URLs ("httpscluesjdko")	Regex detection of evasive URL patterns

Wider feature projection. 23 features → 128d (was 15 → 64d), with LayerNorm and GELU activation.

Residual classifier head. The 1664d combined representation passes through a bottleneck (→256d) with a residual block, improving gradient flow through the deeper head.

Multi-sample dropout. During training, 3 stochastic forward passes through the dropout layer are averaged, acting as a cheap ensemble. This improves probability calibration — v0.1 had 49 FPs at ≥0.99 confidence. v0.2.1 has zero.

Sequence length 256. Doubled from 128 to catch longer MMS messages and delivery scam templates that were being truncated.

Training changes

Change	What	Why (error analysis)
Focal loss (γ=1)	Replaces CrossEntropyLoss	v0.1's FPs clustered at 0.85–1.0 confidence. Focal loss applies (1−p)^γ modulation that down-weights easy predictions, forcing the model to learn the hard boundary between legitimate promos and real spam.
Label smoothing (ε=0.05)	Soft targets (0.025/0.975)	Error analysis found mislabeled examples — phishing messages (SBI YONO, AnPost customs, NHS COVID, Apple Pay) incorrectly labeled as benign. Smoothing prevents the model from memorising noisy labels.
Cosine warm restarts	LR restarts every 2 epochs	Gives the model multiple chances to escape local minima during 8 epochs.
Threshold optimisation	Sweep 0.30–0.85 on val set	v0.1 used a static 0.6993 threshold. v0.2.1 finds the optimal F1 threshold each epoch.
Label audit	7 high-confidence corrections	Phishing messages confirmed mislabeled as benign were corrected before training.
5:1 undersampling	~235k messages (was 3:1 / ~150k)	Retains more training data while keeping manageable class imbalance.

Error Analysis Comparison

v0.2.1 was evaluated against the same error categories from v0.1's analysis. The overlap analysis tracks the exact same messages across both versions.

Targeted failure modes — all fixed

Failure Mode	v0.1	v0.2.1	Status
High-confidence FPs (prob ≥ 0.99)	49	0	✅ Eliminated
High-confidence FNs (prob < 0.10)	68	0	✅ Eliminated
Conversational-style scam FNs	61	0	✅ Eliminated
Obfuscated text FNs	14	0	✅ Eliminated
Mislabeled phishing FPs	3	1	✅ Nearly eliminated

Overlap analysis — how many v0.1 errors are actually fixed?

	False Positives	False Negatives
v0.1 errors	341	283
Fixed in v0.2.1	330 (96.8%)	240 (84.8%)
Still broken	11	43
New in v0.2.1	596	491

96.8% of v0.1's false positives and 84.8% of v0.1's false negatives are resolved. The remaining v0.2.1 errors are predominantly near-threshold cases (274 FNs scoring 0.40–0.67) and short/truncated messages (288 FNs under 40 characters) — diffuse, hard cases rather than systematic failures.

Calibration improvement

Metric	v0.1	v0.2.1
FP mean probability	0.8963	0.8009 (↓ less confident)
FN mean probability	0.3469	0.4829 (↑ closer to boundary)

The model is no longer confidently wrong in either direction. Errors are concentrated near the decision threshold, which is the expected behavior of a well-calibrated classifier.

Training details

Setting	v0.1	v0.2.1
Base model	`deberta-v3-base`	`deberta-v3-base`
Max sequence length	128	256
Batch size (per GPU)	16	8
Gradient accumulation	8	16
Effective batch size	512	512
Epochs	6	8
Learning rate (encoder)	2e-5	2e-5
Learning rate (head)	1e-3	1e-3
LR schedule	Cosine + warmup	Cosine warm restarts (T₀=2 epochs)
Warmup ratio	0.1	0.1
Loss function	CrossEntropyLoss	FocalLoss(γ=1.0, smoothing=0.05)
Class weighting	Balanced	Balanced
Precision	bfloat16	bfloat16
Gradient checkpointing	No	Yes
Multi-sample dropout	—	3 passes
Engineered features	15	23
Training time	~45 min	~189 min
Hardware	4× RTX 3090	4× RTX 3090

Dataset

Trained on a curated English-only dataset compiled from 4+ source datasets (SpamDam, Discord, SmishTank, and others) covering SMS spam, smishing, and phishing messages. The raw dataset contains ~~700k messages, undersampled to a 5:1 benign-to-spam ratio (~~235k training samples).

v0.2.1 includes a label audit pass where 7 high-confidence mislabeled examples (phishing messages incorrectly labeled as benign) were corrected before training.

Label mapping: 0 = benign, 1 = spam/smishing

Usage

Installation

pip install torch transformers scikit-learn joblib sentencepiece

Inference

import re
import math
import json
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import Counter
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download
import joblib

# ── Feature extraction (must match training) ────────────────────────────────

URGENCY_WORDS = {
    "urgent", "immediately", "expires", "verify", "confirm", "suspended",
    "locked", "alert", "action required", "limited time", "click here",
    "act now", "final notice", "winner", "prize", "claim", "free",
    "blocked", "deactivated", "unusual activity",
}
URL_PATTERN      = re.compile(r'(https?://|www\.)\S+|\w+\.(com|net|org|io|co|uk)', re.I)
SHORTENED        = {"bit.ly","tinyurl.com","goo.gl","t.co","ow.ly","smsg.io","rb.gy"}
PHONE_PATTERN    = re.compile(r'(\+?\d[\d\s\-().]{7,}\d)')
EMAIL_PATTERN    = re.compile(r'[\w.+-]+@[\w-]+\.[a-z]{2,}', re.I)
CURRENCY_PATTERN = re.compile(r'[$£€₹¥]|(usd|gbp|eur|inr)', re.I)
LEET_MAP         = str.maketrans("013457@!", "oieastai")
OBFUSCATED_URL   = re.compile(
    r"(https?(?:clue|[a-z]{4,}[a-z0-9]{2,})\b)"
    r"|(?:h\s*t\s*t\s*p)"
    r"|(?:www\s*\.\s*\w)"
    r"|(?:\w+\s*\.\s*(?:com|net|org|xyz|info|co)\b)", re.I)
SPACED_WORD      = re.compile(r"\b(?:\w\s){3,}\w\b")


def extract_features(text):
    """Extract all 23 features for a single message."""
    words   = text.split()
    letters = [c for c in text if c.isalpha()]
    chars   = list(text)
    n       = len(chars)

    # Original 15 features
    original = [
        len(text),
        len(words),
        sum(len(w) for w in words) / max(len(words), 1),
        sum(1 for c in letters if c.isupper()) / max(len(letters), 1),
        sum(1 for c in text if c.isdigit()) / max(len(text), 1),
        sum(1 for c in text if not c.isalnum() and not c.isspace()) / max(len(text), 1),
        text.count('!'),
        text.count('?'),
        int(bool(URL_PATTERN.search(text))),
        len(URL_PATTERN.findall(text)),
        int(any(d in text.lower() for d in SHORTENED)),
        int(bool([m for m in PHONE_PATTERN.findall(text) if len(re.sub(r'\D','',m)) >= 7])),
        int(bool(EMAIL_PATTERN.search(text))),
        int(bool(CURRENCY_PATTERN.search(text))),
        sum(1 for w in URGENCY_WORDS if w in text.lower()),
    ]

    # 8 new features (v0.2.1)
    non_ascii = sum(1 for c in chars if ord(c) > 127)
    counts = Counter(text.lower())
    entropy = -sum((c/n) * math.log2(c/n) for c in counts.values() if c > 0) if n > 0 else 0.0
    translated = text.translate(LEET_MAP)
    leet_changes = sum(1 for a, b in zip(text, translated) if a != b)
    max_drun, cur = 0, 0
    for c in chars:
        if c.isdigit(): cur += 1; max_drun = max(max_drun, cur)
        else: cur = 0
    repeats = sum(1 for i in range(1, n) if chars[i] == chars[i-1]) if n > 1 else 0

    new = [
        non_ascii / max(n, 1),                                       # unicode_ratio
        entropy,                                                     # char_entropy
        len(SPACED_WORD.findall(text)),                              # suspicious_spacing
        leet_changes / max(n, 1),                                    # leet_ratio
        max_drun,                                                    # max_digit_run
        repeats / max(n - 1, 1),                                     # repeated_char_ratio
        len(set(w.lower() for w in words)) / max(len(words), 1),     # vocab_richness
        int(bool(OBFUSCATED_URL.search(text))),                      # has_obfuscated_url
    ]

    return original + new


# ── Model definition ─────────────────────────────────────────────────────────

class AttentionPooling(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, 1, bias=False),
        )

    def forward(self, hidden_states, attention_mask):
        scores = self.attention(hidden_states).squeeze(-1)
        scores = scores.masked_fill(attention_mask == 0, float("-inf"))
        weights = torch.softmax(scores, dim=-1).unsqueeze(-1)
        return (hidden_states * weights).sum(dim=1)


class DeBERTaWithFeaturesV2(nn.Module):
    def __init__(self, model_name, num_extra_features=23, num_labels=2, dropout=0.1):
        super().__init__()
        self.deberta = AutoModel.from_pretrained(model_name)
        H = self.deberta.config.hidden_size
        self.attn_pool = AttentionPooling(H)
        feat_dim = 128
        self.feature_proj = nn.Sequential(
            nn.Linear(num_extra_features, feat_dim),
            nn.LayerNorm(feat_dim), nn.GELU(), nn.Dropout(dropout),
        )
        combined_dim = 2 * H + feat_dim
        bottleneck = 256
        self.fc1 = nn.Linear(combined_dim, bottleneck)
        self.ln1 = nn.LayerNorm(bottleneck)
        self.residual_block = nn.Sequential(
            nn.Linear(bottleneck, bottleneck), nn.LayerNorm(bottleneck),
            nn.GELU(), nn.Dropout(dropout),
            nn.Linear(bottleneck, bottleneck), nn.LayerNorm(bottleneck),
        )
        self.dropout = nn.Dropout(dropout)
        self.output_head = nn.Linear(bottleneck, num_labels)

    def forward(self, input_ids, attention_mask, extra_features):
        out = self.deberta(input_ids=input_ids, attention_mask=attention_mask)
        hidden = out.last_hidden_state
        cls_emb  = hidden[:, 0, :]
        attn_emb = self.attn_pool(hidden, attention_mask)
        feat = self.feature_proj(extra_features)
        combined = torch.cat([cls_emb, attn_emb, feat], dim=1)
        x = F.gelu(self.ln1(self.fc1(combined)))
        x = x + self.residual_block(x)
        return self.output_head(self.dropout(x))


# ── Load model ───────────────────────────────────────────────────────────────

model_id  = "notd5a/deberta-v3-malicious-sms-mms-detector"
device    = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
scaler    = joblib.load(hf_hub_download(model_id, "scaler.pkl"))

model = DeBERTaWithFeaturesV2(model_id)
state = torch.load(hf_hub_download(model_id, "pytorch_model.pt"), map_location=device)
model.load_state_dict(state)
model.to(device).eval()

# Load optimised threshold
with open(hf_hub_download(model_id, "threshold.json")) as f:
    THRESHOLD = json.load(f)["threshold"]


# ── Predict ──────────────────────────────────────────────────────────────────

def predict(texts):
    if isinstance(texts, str):
        texts = [texts]

    enc = tokenizer(texts, max_length=256, padding="max_length",
                    truncation=True, return_tensors="pt")
    raw_feats = np.array([extract_features(t) for t in texts], dtype=np.float32)
    scaled    = torch.tensor(scaler.transform(raw_feats), dtype=torch.float32).to(device)

    with torch.no_grad():
        logits = model(enc["input_ids"].to(device), enc["attention_mask"].to(device), scaled)
        probs  = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()

    return [{"text": t, "label": int(p >= THRESHOLD),
             "prob_spam": round(float(p), 4),
             "prediction": "spam/smishing" if p >= THRESHOLD else "benign"}
            for t, p in zip(texts, probs)]


# ── Example ──────────────────────────────────────────────────────────────────

results = predict([
    "Your account has been suspended. Verify immediately: http://bit.ly/abc123",
    "Hey, are you free for lunch tomorrow?",
    "Y ou've got mail: new messa ge w7",
    "Flat 30% OFF on all ethnic wear! Shop now at bit.ly/sale2026",
    "click httpscluesjdko to download app",
])
for r in results:
    print(r)

Engineered features reference

All 23 features, computed at inference time from the raw message text:

Original 15 features (v0.1)

#	Feature	Type	Description
1	`char_count`	int	Total character count
2	`word_count`	int	Total word count (whitespace split)
3	`avg_word_length`	float	Mean word length
4	`uppercase_ratio`	float	Uppercase letters / all letters
5	`digit_ratio`	float	Digits / total characters
6	`special_char_ratio`	float	Non-alphanumeric, non-space / total characters
7	`exclamation_count`	int	Count of `!`
8	`question_mark_count`	int	Count of `?`
9	`has_url`	binary	Contains URL pattern
10	`url_count`	int	Number of URLs detected
11	`has_shortened_url`	binary	Contains bit.ly, t.co, etc.
12	`has_phone_number`	binary	Contains phone number (≥7 digits)
13	`has_email`	binary	Contains email address
14	`has_currency`	binary	Contains currency symbol or code
15	`urgency_score`	int	Count of urgency keywords matched

New 8 features (v0.2.1)

#	Feature	Type	Targets	Description
16	`unicode_ratio`	float	Unicode evasion	Non-ASCII characters / total characters
17	`char_entropy`	float	Template spam	Shannon entropy over character distribution
18	`suspicious_spacing`	int	Spaced-out evasion	Count of "m e s s a g e" style patterns
19	`leet_ratio`	float	L33t speak	Characters that map to leet translations / total
20	`max_digit_run`	int	Embedded numbers	Longest consecutive digit sequence
21	`repeated_char_ratio`	float	Exclamation spam	Consecutive repeated chars / (length − 1)
22	`vocab_richness`	float	Low-diversity spam	Unique words / total words
23	`has_obfuscated_url`	binary	Broken URLs	Detects evasive URL patterns

v0.1 error analysis methodology

The v0.1 → v0.2.1 changes were motivated by a structured error analysis on v0.1's 18,995-sample test set:

                        v0.1 Test Set (18,995 samples)
                ┌──────────────────┬──────────────────┐
                │  Predicted       │  Predicted       │
                │  Benign          │  Spam            │
┌───────────────┼──────────────────┼──────────────────┤
│ True Benign   │  13,905 (TN)     │    341 (FP)      │
│ True Spam     │    283 (FN)      │  4,466 (TP)      │
└───────────────┴──────────────────┴──────────────────┘

False positive breakdown (341 → benign messages flagged as spam)

The FPs clustered at high confidence (49 with prob ≥ 0.99) and fell into distinct categories:

Retail/brand promotions (~60): Legitimate sale announcements sharing surface features with spam (URLs, urgency words, "FLAT 30% OFF")
Gaming/internet discussion (~30): Discord messages about GTA, Battlefield, CoD — high digit ratios and uppercase triggered false alarms
Bank transaction alerts (~20): Nigerian bank debit notifications with account numbers, amounts, uppercase text
Telecom notifications (~15): Carrier data plan alerts, customer service messages
Mislabeled phishing (~3): Actual phishing (SBI YONO, NHS COVID, Apple Pay) incorrectly labeled benign — the model got these right

False negative breakdown (283 → spam that slipped through)

Near-threshold (120): Spam scoring 0.40–0.70, just below the decision boundary
Conversational scams (61): Normal-sounding text hiding malicious intent
Truncated/tiny (23): Messages too short for any model to classify
Low-signal promo spam (14): Marketing-style spam lacking typical spam indicators
Obfuscated text (14): "Y ou've got mail: new messa ge w7", "httpscluesjdko"

Files

File	Description
`pytorch_model.pt`	Model weights (DeBERTaWithFeaturesV2)
`tokenizer/`	Saved DeBERTa tokenizer
`scaler.pkl`	StandardScaler fitted on 23 training features
`threshold.json`	Optimised classification threshold (0.67)
`config.json`	DeBERTa base config
`training_history.csv`	Per-epoch metrics for all 8 epochs

Limitations

English-only. Non-English messages in the training data were filtered; the model may misclassify non-English spam or flag non-English legitimate messages.
Optimised for SMS/MMS (≤256 tokens). Longer content like emails will be truncated.
Short message weakness. Messages under 40 characters are the dominant remaining failure mode (288 FNs). There is insufficient text for either DeBERTa or the engineered features to provide signal.
Promotional boundary. Legitimate marketing with aggressive language (flash sales, urgency CTAs) can still trigger false positives, though at lower confidence than v0.1.
Evasion arms race. Novel obfuscation techniques not in training data will reduce recall over time.
No sender metadata. The model operates on message text only — sender reputation, short codes, and carrier signals are not available.

Version history

Version	Date	F1	AUC	Key changes
v0.1	2026-03	0.9299	0.9906	Initial release, CLS pooling, 15 features, CrossEntropyLoss
v0.2.1	2026-03	0.9022	0.9857	Attention pooling, 23 features, focal loss (γ=1), label audit, 5:1 undersampling. Eliminated all high-confidence errors and obfuscation/conversational blind spots.

License

CC BY-NC 4.0 — free for research and non-commercial use. Commercial use requires explicit permission.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for notd5a/deberta-v3-malicious-sms-mms-detector-v0.2.1

Base model

microsoft/deberta-v3-base

Finetuned

(589)

this model