π‘ About This Model
- Architecture: Based on
cahya/distilbert-base-indonesian - Task: Binary text classification β Spam vs Non-Spam
- Language: Indonesian
- Data Sources: Mixed datasets from Indonesian social platforms
- Performance: 98,21% accuracy on both validation and test sets
β‘ Quick Usage Guide
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("newreyy/spam-detection-distilbert-v1")
model = AutoModelForSequenceClassification.from_pretrained("newreyy/spam-detection-distilbert-v1")
text = "Selamat! Anda memenangkan hadiah ratusan juta rupiah!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=1).squeeze()
predicted_label = torch.argmax(probs).item()
result = "Spam" if predicted_label == 1 else "Not Spam"
confidence = probs[predicted_label].item()
print(f"Input Text: {text}")
print(f"Prediction: {result} (Confidence: {confidence:.2%})")
π Input & Output Format
Input: Single string of Indonesian text (up to 512 tokens)
Output:
0= Not Spam1= Spam- Probability scores for each class (confidence level)
π Performance Metrics
| Metric | Score |
|---|---|
| Accuracy | 98% |
| Precision | 98% |
| Recall | 98% |
| F1-Score | 98% |
Sample classification report:
precision recall f1-score support
Not Spam 1.00 0.96 0.98 609
Spam 0.97 1.00 0.98 674
Overall 0.98 0.98 0.98 1283
π Training Dataset
- Total Samples: 6,415 text samples
- Spam: 3,310 samples (51.6%)
- Not Spam: 3,105 samples (48.4%)
Breakdown by Source:
- π± SMS: 1,143 entries
- π§ Emails: 2,636 entries
- π General spam posts: 2,636 entries
- π¦ Tweets and πΈ Instagram posts
Data was preprocessed to remove empty lines, invalid entries, and normalized using standard NLP techniques.
βοΈ Training Configuration
| Parameter | Value |
|---|---|
| Pre-trained Model | DistilBERT Base Indonesian |
| Epochs | 3 |
| Batch Size | 16 |
| Learning Rate | 2e-5 |
| Optimizer | AdamW |
| Loss Function | CrossEntropyLoss |
| Max Sequence Length | 512 tokens |
β οΈ Model Notes
- Tailored for standard Indonesian β may have reduced accuracy for slang, regional dialects, or mixed language (code-switching).
- Fine-tuned on everyday user-generated content from social platforms.
- Truncates any input longer than 512 tokens.
π License
This model is distributed under the MIT License, free to use for both academic and commercial projects.
π How to Cite
If you use this model in your work, please cite it as follows:
@misc{newreyy_distilbert_base_indonesian,
title={Indonesian Spam Detection DistilBERT v1},
author={NewReyy},
year={2025},
howpublished={\url{https://huggingface.co/newreyy/spam-detection-distilbert-v1/}}
}
And also consider citing the original base model:
@misc{cahya_distilbert_base_indonesian,
title={DistilBERT Base Indonesian},
author={Cahya},
year={2021},
howpublished={\url{https://huggingface.co/cahya/distilbert-base-indonesian}}
}
β€οΈ Acknowledgements
Special thanks to:
cahyafor the BERT base model.- Open Indonesian NLP contributors.
- Community datasets from various public platforms.
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support