train_run-qwen2.5-1.5b-instruct-arabic-diacritization-full-fadel-L40S_full

This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0375

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 32
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 256
  • optimizer: Use paged_adamw_32bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: constant
  • lr_scheduler_warmup_ratio: 0.03
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss
0.0662 0.1995 100 0.0664
0.0511 0.3989 200 0.0513
0.0445 0.5984 300 0.0446
0.0409 0.7978 400 0.0413
0.0373 0.9973 500 0.0375

Framework versions

  • PEFT 0.15.2
  • Transformers 4.52.3
  • Pytorch 2.7.0+cu128
  • Datasets 3.6.0
  • Tokenizers 0.21.1

prompts used during training

SYSTEM_PROMPT = "<|im_start|>system\nYou are an expert Arabic linguist. Your task is to add diacritics (Tashkeel) to the given Arabic text accurately.<|im_end|>"
USER_PROMPT_TEMPLATE = "<|im_start|>user\nDiacritize the following text:\n{undiacritized_text}<|im_end|>"
ASSISTANT_PROMPT_TEMPLATE = "<|im_start|>assistant\n{diacritized_text}<|im_end|>" # Note: includes EOS token

code for testing

from peft import PeftModel
import re

# --- Load Base Model and Adapter for Inference ---
print("Loading model for inference...")

# Load the base model with quantization (same config as training)
base_model_for_inference = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=quantization_config, # Use the same BNB config
    device_map="auto",
    trust_remote_code=True
)

# Load the PEFT adapter
adapter_path = final_adapter_dir # Use the path where the adapter was saved
inference_model = PeftModel.from_pretrained(base_model_for_inference, adapter_path)
inference_model.eval() # Set to evaluation mode

print("Model ready for inference.")

# --- Inference Function ---
def diacritize_text(text, model, tokenizer):
    """Takes undiacritized text and returns the model's diacritized version."""
    # Prepare the prompt
    undiacritized = re.sub(r'\s+', ' ', text).strip() # Basic cleaning
    user_prompt = USER_PROMPT_TEMPLATE.format(undiacritized_text=undiacritized)
    # Construct the prompt that the model expects *up to* the assistant's turn
    prompt = f"{SYSTEM_PROMPT}\n{user_prompt}\n<|im_start|>assistant\n"

    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

    # Generate output
    # Adjust max_new_tokens based on expected output length; add other generation params as needed
    # Using greedy decoding here, consider beam search, top-k, top-p for potentially better results
    outputs = model.generate(
        **inputs,
        max_new_tokens=len(inputs['input_ids'][0])*2, # Allow for diacritics and some buffer
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        top_k=1,
    )

    # Decode the generated tokens
    # Decode only the *newly generated* tokens
    generated_ids = outputs[0, inputs['input_ids'].shape[1]:]
    decoded_output = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    # Sometimes the model might hallucinate extra text after the main diacritization.
    # A simple heuristic: stop at the first newline or excessive length if needed.
    # decoded_output = decoded_output.split('\n')[0] # Simple cleanup if needed

    return decoded_output

# --- Example Usage ---
input_text = "اكل الولد التفاحة في الحديقة"
print(f"\nInput Text: {input_text}")

diacritized_output = diacritize_text(input_text, inference_model, tokenizer)
print(f"Model Output: {diacritized_output}")

input_text_2 = "بسم الله الرحمن الرحيم"
print(f"\nInput Text: {input_text_2}")
diacritized_output_2 = diacritize_text(input_text_2, inference_model, tokenizer)
print(f"Model Output: {diacritized_output_2}")

input_text_3 = "تعتبر اللغة العربية من أصعب اللغات في العالم"
print(f"\nInput Text: {input_text_3}")
diacritized_output_3 = diacritize_text(input_text_3, inference_model, tokenizer)
print(f"Model Output: {diacritized_output_3}")

train run

https://wandb.ai/bishertello-/uncategorized/runs/y99ptnn7?nw=nwuserbishertello

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bisher/train_run-qwen2.5-1.5b-instruct-arabic-diacritization-full-fadel-L40S_full

Adapter
(827)
this model