Arabic Saudi End-of-Utterance Detection Model
Model Description
Fine-tuned BERT model for detecting End-of-Utterance (EOU) in Saudi Arabic conversational text. The model predicts whether a text segment is complete (EOU) or needs continuation.
- Base Model: faisalq/SaudiBERT
- Dataset: SuperSl6/saudi-eou-dataset (7,150 samples)
- Language: Arabic (Saudi Dialect)
- Task: Binary Classification (EOU Detection)
Quick Start
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "MohamedSalamaYasen/saudi_eou_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "شكرا جزيلا على المساعدة"
# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# Predict
model.eval()
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
prediction = torch.argmax(probs, dim=1).item()
# Results
labels = {0: "CONTINUATION", 1: "EOU"}
confidence = probs[0][prediction].item()
print(f"Text: {text}")
print(f"Prediction: {labels[prediction]}")
print(f"Confidence: {confidence:.3f}")
Output:
Text: شكرا جزيلا على المساعدة
Prediction: EOU
Confidence: 0.921
Using Pipeline
from transformers import pipeline
# Create classifier
classifier = pipeline(
"text-classification",
model="MohamedSalamaYasen/saudi_eou_model"
)
# Predict
result = classifier("مرحبا كيف حالك")
print(result)
# Output: [{'label': 'LABEL_1', 'score': 0.876}]
# LABEL_0 = CONTINUATION, LABEL_1 = EOU
Batch Processing
texts = [
"مرحبا كيف حالك", # Complete utterance
"أنا بخير والحمد لله", # Complete utterance
"بس المشكلة انه", # Incomplete - needs continuation
"شكرا على وقتك" # Complete utterance
]
results = classifier(texts)
for text, result in zip(texts, results):
label = "EOU" if result['label'] == 'LABEL_1' else "CONTINUATION"
print(f"{text}: {label} ({result['score']:.3f})")
Model Details
Architecture
- Model Type: BertForSequenceClassification
- Parameters: ~110M
- Max Length: 512 tokens
- Labels:
0: NOT_EOU (continuation needed)1: EOU (complete utterance)
Training Details
- Base Model: faisalq/SaudiBERT
- Dataset Size: 7,150 samples
- Training Split: 80% train, 10% validation, 10% test
- Epochs: 3
- Batch Size: 16
- Learning Rate: 2e-5
- Optimizer: AdamW
Performance
| Metric | Score |
|---|---|
| Accuracy | 98.7% |
| F1 Score | 0.99 |
| Precision | 0.98 |
| Recall | 1.00 |
Limitations
- Dialect: Optimized for Saudi Arabic; may not work well with other dialects
- Domain: Trained on conversational text; formal/written Arabic may differ
- Length: Limited to 512 tokens per input
- Punctuation: Model removes punctuation during preprocessing
- Short Utterances: Very short texts (1-2 words) may have lower accuracy
Citation
@misc{saudi_eou_model_2024,
author = {Mohamed Salama Yasen},
title = {Arabic End-of-Utterance Detection Model for Saudi Dialect},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/MohamedSalamaYasen/saudi_eou_model}
}
License
Apache 2.0
Acknowledgments
- Base model: faisalq/SaudiBERT
- Dataset: SuperSl6/saudi-eou-dataset
- Framework: HuggingFace Transformers
Model Version: 1.0
Last Updated: December 2024
Status: Production Ready ✅
- Downloads last month
- 3
Model tree for MohamedSalamaYasen/saudi_eou_model
Base model
faisalq/SaudiBERT