Qwen2.5-1.5B-AMIYA-Palestinian

Fine-tuned Qwen2.5-1.5B-Instruct model for Palestinian Arabic Dialect generation and translation, prepared for the AMIYA (Arabic Modeling In Your Accent) Shared Task at VarDial 2026.

Model Details

  • Base Model: Qwen/Qwen2.5-1.5B-Instruct
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Target Dialect: Palestinian Arabic (Palestinian DA)
  • Task: Generation and Translation (MSA↔Palestinian, English↔Palestinian)
  • Competition: AMIYA Shared Task @ VarDial 2026

Model Architecture

  • Base Model: Qwen2.5-1.5B-Instruct (1.5B parameters)
  • Adapter: LoRA with rank=16, alpha=32
  • Target Modules: q_proj, k_proj, v_proj, o_proj
  • Trainable Parameters: ~4.2M (0.28% of base model)

Training Details

Training Data

This model was fine-tuned on Palestinian Arabic dialect data prepared from multiple publicly available sources:

  1. AMIYA Training Dataset (amiya_data/train.jsonl)

    • Source: AMIYA Shared Task training data
    • Examples: 15,000 training examples (sampled from 38,610 available)
    • Task Types: Generation (100%)
    • Format: Qwen2.5 chat template formatted instruction-output pairs
  2. Additional Data Sources Used for Data Preparation:

    • Combined Dialect Dataset: Aggregated Palestinian dialect text examples
    • Maknuune Corpus (v1.0.1): Palestinian Arabic dialect lexicon with translation pairs
    • Shami Dataset: Palestinian dialect corpus
    • Casablanca Dataset: Palestinian dialect speech transcriptions

Training Configuration

  • Epochs: 1
  • Batch Size: 16 (per device) × 2 (gradient accumulation) = 32 effective
  • Learning Rate: 2e-4
  • Max Sequence Length: 256 tokens
  • Warmup Steps: 100
  • Optimization: bfloat16 mixed precision, gradient checkpointing
  • Training Time: <8 hours on single GPU

AMIYA Track Classification

Track: Open

This submission uses:

  • ✅ Publicly available base model (Qwen/Qwen2.5-1.5B-Instruct)
  • ✅ Publicly available training data sources (Maknuune, Shami, Casablanca, Combined Dataset)
  • ✅ AMIYA provided training data

The model was trained on data prepared from publicly available Palestinian dialect corpora, in addition to the AMIYA provided training set.

Usage

Installation

pip install transformers peft torch

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Khamad/qwen2.5-1.5b-amiya-palestinian")

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Generation Example

# Format prompt using Qwen2.5 chat template
prompt = """<|im_start|>system
You are a helpful assistant that generates text in Palestinian Arabic dialect. Write naturally in Palestinian dialect.<|im_end|>
<|im_start|>user
Write a greeting in Palestinian dialect.<|im_end|>
<|im_start|>assistant
"""

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Translation Example

# MSA to Palestinian translation
prompt = """<|im_start|>system
Translate the following Modern Standard Arabic text to Palestinian Arabic dialect.<|im_end|>
<|im_start|>user
السلام عليكم ورحمة الله وبركاته<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

Evaluation

This model is evaluated on the AMIYA Shared Task benchmark (AL-QASIDA) with the following metrics:

  • ADI2 Dialect Fidelity Score: Measures dialect authenticity
  • chrF++ Translation Score: Evaluates translation quality (DA↔English, DA↔MSA)
  • Human Evaluation: Fluency and dialect adherence

Limitations

  • Model is fine-tuned primarily on generation tasks (translation examples were limited during training)
  • Small model size (1.5B parameters) may limit performance on complex translation tasks
  • Training data focused on Palestinian dialect, performance on other Arabic dialects may vary

Citation

If you use this model, please cite:

@misc{qwen2.5-1.5b-amiya-palestinian,
  title={Qwen2.5-1.5B-AMIYA-Palestinian: Fine-tuned Model for Palestinian Arabic Dialect},
  author={Khamad},
  year={2025},
  howpublished={\url{https://huggingface.co/Khamad/qwen2.5-1.5b-amiya-palestinian}},
  note={Submission to AMIYA Shared Task @ VarDial 2026}
}

References

License

This model is released under the Apache 2.0 license, consistent with the base Qwen2.5-1.5B-Instruct model.

Contact

For questions about this model or the AMIYA submission:


Note: This model was developed for the AMIYA Shared Task evaluation. Results may vary depending on the specific evaluation setup and prompts used.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Khamad/qwen2.5-1.5b-amiya-palestinian

Adapter
(820)
this model

Paper for Khamad/qwen2.5-1.5b-amiya-palestinian