Arabic Saudi End-of-Utterance Detection Model

Model Description

Fine-tuned BERT model for detecting End-of-Utterance (EOU) in Saudi Arabic conversational text. The model predicts whether a text segment is complete (EOU) or needs continuation.

Base Model: faisalq/SaudiBERT
Dataset: SuperSl6/saudi-eou-dataset (7,150 samples)
Language: Arabic (Saudi Dialect)
Task: Binary Classification (EOU Detection)

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "MohamedSalamaYasen/saudi_eou_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "شكرا جزيلا على المساعدة"

# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# Predict
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    prediction = torch.argmax(probs, dim=1).item()

# Results
labels = {0: "CONTINUATION", 1: "EOU"}
confidence = probs[0][prediction].item()

print(f"Text: {text}")
print(f"Prediction: {labels[prediction]}")
print(f"Confidence: {confidence:.3f}")

Output:

Text: شكرا جزيلا على المساعدة
Prediction: EOU
Confidence: 0.921

Using Pipeline

from transformers import pipeline

# Create classifier
classifier = pipeline(
    "text-classification",
    model="MohamedSalamaYasen/saudi_eou_model"
)

# Predict
result = classifier("مرحبا كيف حالك")
print(result)
# Output: [{'label': 'LABEL_1', 'score': 0.876}]
# LABEL_0 = CONTINUATION, LABEL_1 = EOU

Batch Processing

texts = [
    "مرحبا كيف حالك",           # Complete utterance
    "أنا بخير والحمد لله",      # Complete utterance  
    "بس المشكلة انه",          # Incomplete - needs continuation
    "شكرا على وقتك"            # Complete utterance
]

results = classifier(texts)

for text, result in zip(texts, results):
    label = "EOU" if result['label'] == 'LABEL_1' else "CONTINUATION"
    print(f"{text}: {label} ({result['score']:.3f})")

Model Details

Architecture

Model Type: BertForSequenceClassification
Parameters: ~110M
Max Length: 512 tokens
Labels:
- 0: NOT_EOU (continuation needed)
- 1: EOU (complete utterance)

Training Details

Base Model: faisalq/SaudiBERT
Dataset Size: 7,150 samples
Training Split: 80% train, 10% validation, 10% test
Epochs: 3
Batch Size: 16
Learning Rate: 2e-5
Optimizer: AdamW

Performance

Metric	Score
Accuracy	98.7%
F1 Score	0.99
Precision	0.98
Recall	1.00

Limitations

Dialect: Optimized for Saudi Arabic; may not work well with other dialects
Domain: Trained on conversational text; formal/written Arabic may differ
Length: Limited to 512 tokens per input
Punctuation: Model removes punctuation during preprocessing
Short Utterances: Very short texts (1-2 words) may have lower accuracy

Citation

@misc{saudi_eou_model_2024,
  author = {Mohamed Salama Yasen},
  title = {Arabic End-of-Utterance Detection Model for Saudi Dialect},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/MohamedSalamaYasen/saudi_eou_model}
}

License

Apache 2.0

Acknowledgments

Model Version: 1.0
Last Updated: December 2024
Status: Production Ready ✅

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for MohamedSalamaYasen/saudi_eou_model

Base model

faisalq/SaudiBERT

Finetuned

(1)

this model

MohamedSalamaYasen
/

saudi_eou_model