Arabic Saudi End-of-Utterance Detection Model

Model Description

Fine-tuned BERT model for detecting End-of-Utterance (EOU) in Saudi Arabic conversational text. The model predicts whether a text segment is complete (EOU) or needs continuation.

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "MohamedSalamaYasen/saudi_eou_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "شكرا جزيلا على المساعدة"

# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# Predict
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    prediction = torch.argmax(probs, dim=1).item()

# Results
labels = {0: "CONTINUATION", 1: "EOU"}
confidence = probs[0][prediction].item()

print(f"Text: {text}")
print(f"Prediction: {labels[prediction]}")
print(f"Confidence: {confidence:.3f}")

Output:

Text: شكرا جزيلا على المساعدة
Prediction: EOU
Confidence: 0.921

Using Pipeline

from transformers import pipeline

# Create classifier
classifier = pipeline(
    "text-classification",
    model="MohamedSalamaYasen/saudi_eou_model"
)

# Predict
result = classifier("مرحبا كيف حالك")
print(result)
# Output: [{'label': 'LABEL_1', 'score': 0.876}]
# LABEL_0 = CONTINUATION, LABEL_1 = EOU

Batch Processing

texts = [
    "مرحبا كيف حالك",           # Complete utterance
    "أنا بخير والحمد لله",      # Complete utterance  
    "بس المشكلة انه",          # Incomplete - needs continuation
    "شكرا على وقتك"            # Complete utterance
]

results = classifier(texts)

for text, result in zip(texts, results):
    label = "EOU" if result['label'] == 'LABEL_1' else "CONTINUATION"
    print(f"{text}: {label} ({result['score']:.3f})")

Model Details

Architecture

  • Model Type: BertForSequenceClassification
  • Parameters: ~110M
  • Max Length: 512 tokens
  • Labels:
    • 0: NOT_EOU (continuation needed)
    • 1: EOU (complete utterance)

Training Details

  • Base Model: faisalq/SaudiBERT
  • Dataset Size: 7,150 samples
  • Training Split: 80% train, 10% validation, 10% test
  • Epochs: 3
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Optimizer: AdamW

Performance

Metric Score
Accuracy 98.7%
F1 Score 0.99
Precision 0.98
Recall 1.00

Limitations

  • Dialect: Optimized for Saudi Arabic; may not work well with other dialects
  • Domain: Trained on conversational text; formal/written Arabic may differ
  • Length: Limited to 512 tokens per input
  • Punctuation: Model removes punctuation during preprocessing
  • Short Utterances: Very short texts (1-2 words) may have lower accuracy

Citation

@misc{saudi_eou_model_2024,
  author = {Mohamed Salama Yasen},
  title = {Arabic End-of-Utterance Detection Model for Saudi Dialect},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/MohamedSalamaYasen/saudi_eou_model}
}

License

Apache 2.0

Acknowledgments


Model Version: 1.0
Last Updated: December 2024
Status: Production Ready ✅

Downloads last month
3
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MohamedSalamaYasen/saudi_eou_model

Finetuned
(1)
this model

Dataset used to train MohamedSalamaYasen/saudi_eou_model