You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for Model ID

Model Details

Model Description

This model is a fine-tuned version of the Facebook M2M100 (418M parameters), specifically adapted for translating Algerian dialect (ARQ) into Modern Standard Arabic (ARB). The fine-tuning process used a parallel dataset of 137,000 sentence pairs to improve the model’s translation accuracy for this specific language pair.

  • Model type: Multilingual Machine Translation (Transformer, Encoder–Decoder)
  • Language(s) (NLP): Algerian Dialect (ARQ) → Modern Standard Arabic (ARB)
  • Finetuned from model [optional]: facebook/m2m100_418M

Uses

Direct Use

This model can be used for:

•	Translating Algerian dialect (ARQ) text into Modern Standard Arabic (ARB).

•	Useful for NLP applications focusing on Arabic text normalization and understanding.

•	Improving Arabic language understanding systems with a focus on Algerian dialect.

Downstream Use

This model could be used in language translation applications, chatbots, or other NLP systems that require Algerian dialect processing.

Bias, Risks, and Limitations

•	Bias: The model might reflect biases present in the training data, particularly linguistic or cultural biases.

•	Risks: Incorrect or misleading translations may occur, especially with highly ambiguous or slang terms.

•	Limitations: It is specific to Algerian dialect (ARQ) and Modern Standard Arabic (ARB) and may not generalize to other dialects, languages, or specialized domains.

Recommendations

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Model repository
repo_name = "Aicha-zkr/M2M100-Algerian-Dialect-to-MSA"

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(repo_name)
tokenizer = AutoTokenizer.from_pretrained(repo_name)

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Example Algerian dialect sentences
source_sentences = [
    "كي العادة راني نخدم بزاف وما لقيتش وقت نرتاح",
    "اليوم الجو حار، ما قدرتش نخرج",
    "واش راك داير؟ نحتاجو نتلاقاو",
    "أنا عندي مشكلة في الانترنت، ما يشتغلش",
    "راني محتار بين هاد الخيارين",
    "واش رأيك في هاد الفيلم؟ كان مليح",
    "شحال من مرة قلتلك ما تديرهاش؟",
    "أحتاج نروح عند الطبيب بكري",
    "إلى كانت الخدمة صعيبة، خليها",
    "خليت الدار وراحت الرحلة كلها كانت ممتازة"
]

# Translate each sentence to Modern Standard Arabic
for source_sentence in source_sentences:
    # Tokenize
    encoded_input = tokenizer(
        source_sentence,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(device)

    # Generate translation
    arabic_lang_id = tokenizer.get_lang_id("arb")  # Modern Standard Arabic
    generated_tokens = model.generate(
        **encoded_input,
        forced_bos_token_id=arabic_lang_id
    )

    # Decode output
    translated_sentence = tokenizer.batch_decode(
        generated_tokens, skip_special_tokens=True
    )[0]

    print(f"Original: {source_sentence}")
    print(f"Translation: {translated_sentence}\n")

Training Details

Training Data

The model was fine-tuned on a dataset of 137,000 sentence pairs containing Algerian dialect (ARQ) and Modern Standard Arabic (ARB). This parallel dataset allowed the model to specialize in translating this specific dialect.

•   137k sentence pairs ARQ → ARB.

•	14k high-quality human-labeled data.

•	GPT-4o-generated translations (manually verified).

•	Additional manually translated samples.

Training Hyperparameters

•	Max sequence length: 128 tokens.

•	Batch size: 16.

•	Learning rate: 5e-5 (linear decay).

•	Epochs: 2.

•	Precision: Mixed FP16.

•	Training time: ~5.5 hours on Kaggle P100 GPU.

Model Card Contact

Email : aicha.zenakhri@ensia.edu.dz

Downloads last month
-
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support