πŸ’‘ About This Model

  • Architecture: Based on cahya/distilbert-base-indonesian
  • Task: Binary text classification β€” Spam vs Non-Spam
  • Language: Indonesian
  • Data Sources: Mixed datasets from Indonesian social platforms
  • Performance: 98,21% accuracy on both validation and test sets

⚑ Quick Usage Guide

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("newreyy/spam-detection-distilbert-v1")
model = AutoModelForSequenceClassification.from_pretrained("newreyy/spam-detection-distilbert-v1")

text = "Selamat! Anda memenangkan hadiah ratusan juta rupiah!"

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=1).squeeze()

predicted_label = torch.argmax(probs).item()
result = "Spam" if predicted_label == 1 else "Not Spam"
confidence = probs[predicted_label].item()

print(f"Input Text: {text}")
print(f"Prediction: {result} (Confidence: {confidence:.2%})")

πŸ“Œ Input & Output Format

  • Input: Single string of Indonesian text (up to 512 tokens)

  • Output:

    • 0 = Not Spam
    • 1 = Spam
    • Probability scores for each class (confidence level)

πŸ“ˆ Performance Metrics

Metric Score
Accuracy 98%
Precision 98%
Recall 98%
F1-Score 98%

Sample classification report:

              precision    recall  f1-score   support
Not Spam       1.00      0.96      0.98       609
Spam           0.97      1.00      0.98       674
Overall        0.98      0.98      0.98      1283

πŸ“‚ Training Dataset

  • Total Samples: 6,415 text samples
  • Spam: 3,310 samples (51.6%)
  • Not Spam: 3,105 samples (48.4%)

Breakdown by Source:

  • πŸ“± SMS: 1,143 entries
  • πŸ“§ Emails: 2,636 entries
  • 🌐 General spam posts: 2,636 entries
  • 🐦 Tweets and πŸ“Έ Instagram posts

Data was preprocessed to remove empty lines, invalid entries, and normalized using standard NLP techniques.


βš™οΈ Training Configuration

Parameter Value
Pre-trained Model DistilBERT Base Indonesian
Epochs 3
Batch Size 16
Learning Rate 2e-5
Optimizer AdamW
Loss Function CrossEntropyLoss
Max Sequence Length 512 tokens

⚠️ Model Notes

  • Tailored for standard Indonesian β€” may have reduced accuracy for slang, regional dialects, or mixed language (code-switching).
  • Fine-tuned on everyday user-generated content from social platforms.
  • Truncates any input longer than 512 tokens.

πŸ“œ License

This model is distributed under the MIT License, free to use for both academic and commercial projects.


πŸ“ How to Cite

If you use this model in your work, please cite it as follows:

@misc{newreyy_distilbert_base_indonesian,
  title={Indonesian Spam Detection DistilBERT v1},
  author={NewReyy},
  year={2025},
  howpublished={\url{https://huggingface.co/newreyy/spam-detection-distilbert-v1/}}
}

And also consider citing the original base model:

@misc{cahya_distilbert_base_indonesian,
  title={DistilBERT Base Indonesian},
  author={Cahya},
  year={2021},
  howpublished={\url{https://huggingface.co/cahya/distilbert-base-indonesian}}
}

❀️ Acknowledgements

Special thanks to:

  • cahya for the BERT base model.
  • Open Indonesian NLP contributors.
  • Community datasets from various public platforms.

Downloads last month
2
Safetensors
Model size
68.1M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support