💡 About This Model

Architecture: Based on cahya/distilbert-base-indonesian
Task: Binary text classification — Spam vs Non-Spam
Language: Indonesian
Data Sources: Mixed datasets from Indonesian social platforms
Performance: 98,21% accuracy on both validation and test sets

⚡ Quick Usage Guide

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("newreyy/spam-detection-distilbert-v1")
model = AutoModelForSequenceClassification.from_pretrained("newreyy/spam-detection-distilbert-v1")

text = "Selamat! Anda memenangkan hadiah ratusan juta rupiah!"

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=1).squeeze()

predicted_label = torch.argmax(probs).item()
result = "Spam" if predicted_label == 1 else "Not Spam"
confidence = probs[predicted_label].item()

print(f"Input Text: {text}")
print(f"Prediction: {result} (Confidence: {confidence:.2%})")

📌 Input & Output Format

Input: Single string of Indonesian text (up to 512 tokens)
Output:
- 0 = Not Spam
- 1 = Spam
- Probability scores for each class (confidence level)

📈 Performance Metrics

Metric	Score
Accuracy	98%
Precision	98%
Recall	98%
F1-Score	98%

Sample classification report:

              precision    recall  f1-score   support
Not Spam       1.00      0.96      0.98       609
Spam           0.97      1.00      0.98       674
Overall        0.98      0.98      0.98      1283

📂 Training Dataset

Total Samples: 6,415 text samples
Spam: 3,310 samples (51.6%)
Not Spam: 3,105 samples (48.4%)

Breakdown by Source:

📱 SMS: 1,143 entries
📧 Emails: 2,636 entries
🌐 General spam posts: 2,636 entries
🐦 Tweets and 📸 Instagram posts

Data was preprocessed to remove empty lines, invalid entries, and normalized using standard NLP techniques.

⚙️ Training Configuration

Parameter	Value
Pre-trained Model	DistilBERT Base Indonesian
Epochs	3
Batch Size	16
Learning Rate	2e-5
Optimizer	AdamW
Loss Function	CrossEntropyLoss
Max Sequence Length	512 tokens

⚠️ Model Notes

Tailored for standard Indonesian — may have reduced accuracy for slang, regional dialects, or mixed language (code-switching).
Fine-tuned on everyday user-generated content from social platforms.
Truncates any input longer than 512 tokens.

📜 License

This model is distributed under the MIT License, free to use for both academic and commercial projects.

📝 How to Cite

If you use this model in your work, please cite it as follows:

@misc{newreyy_distilbert_base_indonesian,
  title={Indonesian Spam Detection DistilBERT v1},
  author={NewReyy},
  year={2025},
  howpublished={\url{https://huggingface.co/newreyy/spam-detection-distilbert-v1/}}
}

And also consider citing the original base model:

@misc{cahya_distilbert_base_indonesian,
  title={DistilBERT Base Indonesian},
  author={Cahya},
  year={2021},
  howpublished={\url{https://huggingface.co/cahya/distilbert-base-indonesian}}
}

❤️ Acknowledgements

Special thanks to:

cahya for the BERT base model.
Open Indonesian NLP contributors.
Community datasets from various public platforms.

Downloads last month: 2

Safetensors

Model size

68.1M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support