SES Industry Email Classifier (BERT Japanese)

🎯 A BERT-based model for automatically classifying business emails in Japan's SES (System Engineering Service) industry

This model classifies emails commonly exchanged in Japan's IT staffing industry with high accuracy, enabling automated email routing and workflow optimization.

📊 Performance

Metric	Score
Accuracy	99.51%
F1 (weighted)	0.99
Precision (weighted)	1.00
Recall (weighted)	1.00

Per-Class Performance

Class	Precision	Recall	F1-Score	Support
案件 (Project)	1.00	1.00	1.00	486
要員 (Talent)	0.99	1.00	1.00	721
その他 (Other)	1.00	0.60	0.75	15

📌 Model Overview

Item	Description
Base Model	tohoku-nlp/bert-base-japanese-v3
Task	3-class classification
Language	Japanese
Max Input	512 tokens (~1000 characters)
Training Data	~4,900 SES industry emails
Test Data	~1,200 emails (held-out)

🏷️ Classification Labels

ID	Label	Description	Example
0	案件 (Project)	Project/job opportunity postings	"【案件】PM募集 60万〜新宿即日"
1	要員 (Talent)	Engineer/talent introduction	"【ご紹介】インフラエンジニア 30代"
2	その他 (Other)	Other business emails	Meeting requests, general correspondence

🚀 Usage

Pipeline (Recommended)

from transformers import pipeline

classifier = pipeline(
    "text-classification", 
    model="naoki-hosokawa/ses-mail-classifier-bert-japanese"
)

# Project email example
result = classifier("【案件】Java開発 60万〜80万 渋谷 即日〜長期 面談1回")
print(result)
# [{'label': '案件', 'score': 0.98}]

# Talent email example
result = classifier("【ご紹介】Javaエンジニア 40代男性 都内在住 即日稼働可能 希望単価55万")
print(result)
# [{'label': '要員', 'score': 0.95}]

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "naoki-hosokawa/ses-mail-classifier-bert-japanese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "【案件】Python開発 リモート可 50万〜"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)

labels = ["案件", "要員", "その他"]
predicted_label = labels[predictions.argmax().item()]
confidence = predictions.max().item()

print(f"Classification: {predicted_label} (Confidence: {confidence:.2%})")

Batch Processing

from transformers import pipeline

classifier = pipeline(
    "text-classification", 
    model="naoki-hosokawa/ses-mail-classifier-bert-japanese",
    device=0  # Use GPU
)

emails = [
    "【案件】AWS構築 フルリモート 55万〜",
    "【ご紹介】クラウドエンジニア 経験5年",
    "明日の会議は15時からでお願いします",
]

results = classifier(emails, batch_size=32)
for email, result in zip(emails, results):
    print(f"{result['label']}: {email[:30]}...")

🎓 Training Details

Training Data

Source: Real business emails from Japan's SES industry
Preprocessing: Subject + body concatenated, truncated to 1000 characters
Split: 4,399 train / 489 validation / 1,222 test

Training Configuration

Epochs: 5
Batch Size: 8
Learning Rate: 5e-5 (default)
Warmup Steps: 10
Weight Decay: 0.01
Best Model Selection: Based on F1 score

Training Environment

Google Colab (T4 GPU)
Training Time: ~38 minutes

⚠️ Limitations

Domain-Specific: Optimized for SES industry emails; may not generalize to other email types
"Other" Class: Lower recall (60%) due to limited samples; consider LLM fallback for low-confidence predictions
Input Length: Texts exceeding 512 tokens will be truncated
Language: Japanese only

🔧 Technical Details

Why BERT?

Aspect	BERT	LLM (GPT, etc.)
Inference Speed	◎ Fast	△ Slow
Inference Cost	◎ Low	× High
Classification Accuracy	○ Good	◎ Excellent
Batch Processing	◎ Efficient	△ Costly

For high-volume email processing (100K+ emails/month), using BERT as a pre-filter significantly reduces LLM API costs while maintaining accuracy.

📜 License

Apache License 2.0

This model follows the license of the base model tohoku-nlp/bert-base-japanese-v3.

🙏 Acknowledgments

Tohoku NLP Lab - Japanese BERT model
Hugging Face - Transformers library

📬 Author

Author: Naoki Hosokawa
Organization: Apple Seed LLC (合同会社アップルシード)
Location: K-1 Building 3F, 1-8-18 Ebisu, Shibuya-ku, Tokyo, Japan

⭐ If you find this model useful, please give it a star!

Downloads last month: 8,978

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for naoki-hosokawa/ses-mail-classifier-bert-japanese

Base model

tohoku-nlp/bert-base-japanese-v3

Finetuned

(47)

this model

Evaluation results

Accuracy
self-reported

0.995
F1 (weighted)
self-reported

0.990