train_run-qwen2.5-1.5b-instruct-arabic-diacritization-full-fadel-L40S_full

This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.0375

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 32
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use paged_adamw_32bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.03
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss
0.0662	0.1995	100	0.0664
0.0511	0.3989	200	0.0513
0.0445	0.5984	300	0.0446
0.0409	0.7978	400	0.0413
0.0373	0.9973	500	0.0375

Framework versions

PEFT 0.15.2
Transformers 4.52.3
Pytorch 2.7.0+cu128
Datasets 3.6.0
Tokenizers 0.21.1

prompts used during training

SYSTEM_PROMPT = "<|im_start|>system\nYou are an expert Arabic linguist. Your task is to add diacritics (Tashkeel) to the given Arabic text accurately.<|im_end|>"
USER_PROMPT_TEMPLATE = "<|im_start|>user\nDiacritize the following text:\n{undiacritized_text}<|im_end|>"
ASSISTANT_PROMPT_TEMPLATE = "<|im_start|>assistant\n{diacritized_text}<|im_end|>" # Note: includes EOS token

code for testing

from peft import PeftModel
import re

# --- Load Base Model and Adapter for Inference ---
print("Loading model for inference...")

# Load the base model with quantization (same config as training)
base_model_for_inference = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=quantization_config, # Use the same BNB config
    device_map="auto",
    trust_remote_code=True
)

# Load the PEFT adapter
adapter_path = final_adapter_dir # Use the path where the adapter was saved
inference_model = PeftModel.from_pretrained(base_model_for_inference, adapter_path)
inference_model.eval() # Set to evaluation mode

print("Model ready for inference.")

# --- Inference Function ---
def diacritize_text(text, model, tokenizer):
    """Takes undiacritized text and returns the model's diacritized version."""
    # Prepare the prompt
    undiacritized = re.sub(r'\s+', ' ', text).strip() # Basic cleaning
    user_prompt = USER_PROMPT_TEMPLATE.format(undiacritized_text=undiacritized)
    # Construct the prompt that the model expects *up to* the assistant's turn
    prompt = f"{SYSTEM_PROMPT}\n{user_prompt}\n<|im_start|>assistant\n"

    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

    # Generate output
    # Adjust max_new_tokens based on expected output length; add other generation params as needed
    # Using greedy decoding here, consider beam search, top-k, top-p for potentially better results
    outputs = model.generate(
        **inputs,
        max_new_tokens=len(inputs['input_ids'][0])*2, # Allow for diacritics and some buffer
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        top_k=1,
    )

    # Decode the generated tokens
    # Decode only the *newly generated* tokens
    generated_ids = outputs[0, inputs['input_ids'].shape[1]:]
    decoded_output = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    # Sometimes the model might hallucinate extra text after the main diacritization.
    # A simple heuristic: stop at the first newline or excessive length if needed.
    # decoded_output = decoded_output.split('\n')[0] # Simple cleanup if needed

    return decoded_output

# --- Example Usage ---
input_text = "اكل الولد التفاحة في الحديقة"
print(f"\nInput Text: {input_text}")

diacritized_output = diacritize_text(input_text, inference_model, tokenizer)
print(f"Model Output: {diacritized_output}")

input_text_2 = "بسم الله الرحمن الرحيم"
print(f"\nInput Text: {input_text_2}")
diacritized_output_2 = diacritize_text(input_text_2, inference_model, tokenizer)
print(f"Model Output: {diacritized_output_2}")

input_text_3 = "تعتبر اللغة العربية من أصعب اللغات في العالم"
print(f"\nInput Text: {input_text_3}")
diacritized_output_3 = diacritize_text(input_text_3, inference_model, tokenizer)
print(f"Model Output: {diacritized_output_3}")

train run

https://wandb.ai/bishertello-/uncategorized/runs/y99ptnn7?nw=nwuserbishertello

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bisher/train_run-qwen2.5-1.5b-instruct-arabic-diacritization-full-fadel-L40S_full

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(827)

this model