SES Industry Email Classifier (BERT Japanese)

🎯 A BERT-based model for automatically classifying business emails in Japan's SES (System Engineering Service) industry

This model classifies emails commonly exchanged in Japan's IT staffing industry with high accuracy, enabling automated email routing and workflow optimization.

📊 Performance

Metric Score
Accuracy 99.51%
F1 (weighted) 0.99
Precision (weighted) 1.00
Recall (weighted) 1.00

Per-Class Performance

Class Precision Recall F1-Score Support
案件 (Project) 1.00 1.00 1.00 486
要員 (Talent) 0.99 1.00 1.00 721
その他 (Other) 1.00 0.60 0.75 15

📌 Model Overview

Item Description
Base Model tohoku-nlp/bert-base-japanese-v3
Task 3-class classification
Language Japanese
Max Input 512 tokens (~1000 characters)
Training Data ~4,900 SES industry emails
Test Data ~1,200 emails (held-out)

🏷️ Classification Labels

ID Label Description Example
0 案件 (Project) Project/job opportunity postings "【案件】PM募集 60万〜 新宿 即日"
1 要員 (Talent) Engineer/talent introduction "【ご紹介】インフラエンジニア 30代"
2 その他 (Other) Other business emails Meeting requests, general correspondence

🚀 Usage

Pipeline (Recommended)

from transformers import pipeline

classifier = pipeline(
    "text-classification", 
    model="naoki-hosokawa/ses-mail-classifier-bert-japanese"
)

# Project email example
result = classifier("【案件】Java開発 60万〜80万 渋谷 即日〜長期 面談1回")
print(result)
# [{'label': '案件', 'score': 0.98}]

# Talent email example
result = classifier("【ご紹介】Javaエンジニア 40代男性 都内在住 即日稼働可能 希望単価55万")
print(result)
# [{'label': '要員', 'score': 0.95}]

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "naoki-hosokawa/ses-mail-classifier-bert-japanese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "【案件】Python開発 リモート可 50万〜"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)

labels = ["案件", "要員", "その他"]
predicted_label = labels[predictions.argmax().item()]
confidence = predictions.max().item()

print(f"Classification: {predicted_label} (Confidence: {confidence:.2%})")

Batch Processing

from transformers import pipeline

classifier = pipeline(
    "text-classification", 
    model="naoki-hosokawa/ses-mail-classifier-bert-japanese",
    device=0  # Use GPU
)

emails = [
    "【案件】AWS構築 フルリモート 55万〜",
    "【ご紹介】クラウドエンジニア 経験5年",
    "明日の会議は15時からでお願いします",
]

results = classifier(emails, batch_size=32)
for email, result in zip(emails, results):
    print(f"{result['label']}: {email[:30]}...")

🎓 Training Details

Training Data

  • Source: Real business emails from Japan's SES industry
  • Preprocessing: Subject + body concatenated, truncated to 1000 characters
  • Split: 4,399 train / 489 validation / 1,222 test

Training Configuration

  • Epochs: 5
  • Batch Size: 8
  • Learning Rate: 5e-5 (default)
  • Warmup Steps: 10
  • Weight Decay: 0.01
  • Best Model Selection: Based on F1 score

Training Environment

  • Google Colab (T4 GPU)
  • Training Time: ~38 minutes

⚠️ Limitations

  • Domain-Specific: Optimized for SES industry emails; may not generalize to other email types
  • "Other" Class: Lower recall (60%) due to limited samples; consider LLM fallback for low-confidence predictions
  • Input Length: Texts exceeding 512 tokens will be truncated
  • Language: Japanese only

🔧 Technical Details

Why BERT?

Aspect BERT LLM (GPT, etc.)
Inference Speed ◎ Fast △ Slow
Inference Cost ◎ Low × High
Classification Accuracy ○ Good ◎ Excellent
Batch Processing ◎ Efficient △ Costly

For high-volume email processing (100K+ emails/month), using BERT as a pre-filter significantly reduces LLM API costs while maintaining accuracy.

📜 License

Apache License 2.0

This model follows the license of the base model tohoku-nlp/bert-base-japanese-v3.

🙏 Acknowledgments

📬 Author

  • Author: Naoki Hosokawa
  • Organization: Apple Seed LLC (合同会社アップルシード)
  • Location: K-1 Building 3F, 1-8-18 Ebisu, Shibuya-ku, Tokyo, Japan

⭐ If you find this model useful, please give it a star!

Downloads last month
8,978
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for naoki-hosokawa/ses-mail-classifier-bert-japanese

Finetuned
(47)
this model

Evaluation results