Model Card for Model ID
Model Details
Model Description
This model is a fine-tuned version of the Facebook M2M100 (418M parameters), specifically adapted for translating Algerian dialect (ARQ) into Modern Standard Arabic (ARB). The fine-tuning process used a parallel dataset of 137,000 sentence pairs to improve the model’s translation accuracy for this specific language pair.
- Model type: Multilingual Machine Translation (Transformer, Encoder–Decoder)
- Language(s) (NLP): Algerian Dialect (ARQ) → Modern Standard Arabic (ARB)
- Finetuned from model [optional]: facebook/m2m100_418M
Uses
Direct Use
This model can be used for:
• Translating Algerian dialect (ARQ) text into Modern Standard Arabic (ARB).
• Useful for NLP applications focusing on Arabic text normalization and understanding.
• Improving Arabic language understanding systems with a focus on Algerian dialect.
Downstream Use
This model could be used in language translation applications, chatbots, or other NLP systems that require Algerian dialect processing.
Bias, Risks, and Limitations
• Bias: The model might reflect biases present in the training data, particularly linguistic or cultural biases.
• Risks: Incorrect or misleading translations may occur, especially with highly ambiguous or slang terms.
• Limitations: It is specific to Algerian dialect (ARQ) and Modern Standard Arabic (ARB) and may not generalize to other dialects, languages, or specialized domains.
Recommendations
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
repo_name = "Aicha-zkr/M2M100-Algerian-Dialect-to-MSA"
model = AutoModelForSeq2SeqLM.from_pretrained(repo_name)
tokenizer = AutoTokenizer.from_pretrained(repo_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
source_sentences = [
"كي العادة راني نخدم بزاف وما لقيتش وقت نرتاح",
"اليوم الجو حار، ما قدرتش نخرج",
"واش راك داير؟ نحتاجو نتلاقاو",
"أنا عندي مشكلة في الانترنت، ما يشتغلش",
"راني محتار بين هاد الخيارين",
"واش رأيك في هاد الفيلم؟ كان مليح",
"شحال من مرة قلتلك ما تديرهاش؟",
"أحتاج نروح عند الطبيب بكري",
"إلى كانت الخدمة صعيبة، خليها",
"خليت الدار وراحت الرحلة كلها كانت ممتازة"
]
for source_sentence in source_sentences:
encoded_input = tokenizer(
source_sentence,
return_tensors="pt",
padding=True,
truncation=True
).to(device)
arabic_lang_id = tokenizer.get_lang_id("arb")
generated_tokens = model.generate(
**encoded_input,
forced_bos_token_id=arabic_lang_id
)
translated_sentence = tokenizer.batch_decode(
generated_tokens, skip_special_tokens=True
)[0]
print(f"Original: {source_sentence}")
print(f"Translation: {translated_sentence}\n")
Training Details
Training Data
The model was fine-tuned on a dataset of 137,000 sentence pairs containing Algerian dialect (ARQ) and Modern Standard Arabic (ARB). This parallel dataset allowed the model to specialize in translating this specific dialect.
• 137k sentence pairs ARQ → ARB.
• 14k high-quality human-labeled data.
• GPT-4o-generated translations (manually verified).
• Additional manually translated samples.
Training Hyperparameters
• Max sequence length: 128 tokens.
• Batch size: 16.
• Learning rate: 5e-5 (linear decay).
• Epochs: 2.
• Precision: Mixed FP16.
• Training time: ~5.5 hours on Kaggle P100 GPU.
Model Card Contact
Email : aicha.zenakhri@ensia.edu.dz