SES Industry Email Classifier (BERT Japanese)
🎯 A BERT-based model for automatically classifying business emails in Japan's SES (System Engineering Service) industry
This model classifies emails commonly exchanged in Japan's IT staffing industry with high accuracy, enabling automated email routing and workflow optimization.
📊 Performance
| Metric | Score |
|---|---|
| Accuracy | 99.51% |
| F1 (weighted) | 0.99 |
| Precision (weighted) | 1.00 |
| Recall (weighted) | 1.00 |
Per-Class Performance
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 案件 (Project) | 1.00 | 1.00 | 1.00 | 486 |
| 要員 (Talent) | 0.99 | 1.00 | 1.00 | 721 |
| その他 (Other) | 1.00 | 0.60 | 0.75 | 15 |
📌 Model Overview
| Item | Description |
|---|---|
| Base Model | tohoku-nlp/bert-base-japanese-v3 |
| Task | 3-class classification |
| Language | Japanese |
| Max Input | 512 tokens (~1000 characters) |
| Training Data | ~4,900 SES industry emails |
| Test Data | ~1,200 emails (held-out) |
🏷️ Classification Labels
| ID | Label | Description | Example |
|---|---|---|---|
| 0 | 案件 (Project) | Project/job opportunity postings | "【案件】PM募集 60万〜 新宿 即日" |
| 1 | 要員 (Talent) | Engineer/talent introduction | "【ご紹介】インフラエンジニア 30代" |
| 2 | その他 (Other) | Other business emails | Meeting requests, general correspondence |
🚀 Usage
Pipeline (Recommended)
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="naoki-hosokawa/ses-mail-classifier-bert-japanese"
)
# Project email example
result = classifier("【案件】Java開発 60万〜80万 渋谷 即日〜長期 面談1回")
print(result)
# [{'label': '案件', 'score': 0.98}]
# Talent email example
result = classifier("【ご紹介】Javaエンジニア 40代男性 都内在住 即日稼働可能 希望単価55万")
print(result)
# [{'label': '要員', 'score': 0.95}]
Direct Model Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "naoki-hosokawa/ses-mail-classifier-bert-japanese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "【案件】Python開発 リモート可 50万〜"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=-1)
labels = ["案件", "要員", "その他"]
predicted_label = labels[predictions.argmax().item()]
confidence = predictions.max().item()
print(f"Classification: {predicted_label} (Confidence: {confidence:.2%})")
Batch Processing
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="naoki-hosokawa/ses-mail-classifier-bert-japanese",
device=0 # Use GPU
)
emails = [
"【案件】AWS構築 フルリモート 55万〜",
"【ご紹介】クラウドエンジニア 経験5年",
"明日の会議は15時からでお願いします",
]
results = classifier(emails, batch_size=32)
for email, result in zip(emails, results):
print(f"{result['label']}: {email[:30]}...")
🎓 Training Details
Training Data
- Source: Real business emails from Japan's SES industry
- Preprocessing: Subject + body concatenated, truncated to 1000 characters
- Split: 4,399 train / 489 validation / 1,222 test
Training Configuration
- Epochs: 5
- Batch Size: 8
- Learning Rate: 5e-5 (default)
- Warmup Steps: 10
- Weight Decay: 0.01
- Best Model Selection: Based on F1 score
Training Environment
- Google Colab (T4 GPU)
- Training Time: ~38 minutes
⚠️ Limitations
- Domain-Specific: Optimized for SES industry emails; may not generalize to other email types
- "Other" Class: Lower recall (60%) due to limited samples; consider LLM fallback for low-confidence predictions
- Input Length: Texts exceeding 512 tokens will be truncated
- Language: Japanese only
🔧 Technical Details
Why BERT?
| Aspect | BERT | LLM (GPT, etc.) |
|---|---|---|
| Inference Speed | ◎ Fast | △ Slow |
| Inference Cost | ◎ Low | × High |
| Classification Accuracy | ○ Good | ◎ Excellent |
| Batch Processing | ◎ Efficient | △ Costly |
For high-volume email processing (100K+ emails/month), using BERT as a pre-filter significantly reduces LLM API costs while maintaining accuracy.
📜 License
Apache License 2.0
This model follows the license of the base model tohoku-nlp/bert-base-japanese-v3.
🙏 Acknowledgments
- Tohoku NLP Lab - Japanese BERT model
- Hugging Face - Transformers library
📬 Author
- Author: Naoki Hosokawa
- Organization: Apple Seed LLC (合同会社アップルシード)
- Location: K-1 Building 3F, 1-8-18 Ebisu, Shibuya-ku, Tokyo, Japan
⭐ If you find this model useful, please give it a star!
- Downloads last month
- 8,978
Model tree for naoki-hosokawa/ses-mail-classifier-bert-japanese
Base model
tohoku-nlp/bert-base-japanese-v3Evaluation results
- Accuracyself-reported0.995
- F1 (weighted)self-reported0.990