train_run-qwen2.5-1.5b-instruct-arabic-diacritization-full-fadel-L40S_full
This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:
- Loss: 0.0375
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 256
- optimizer: Use paged_adamw_32bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.03
- num_epochs: 1
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.0662 | 0.1995 | 100 | 0.0664 |
| 0.0511 | 0.3989 | 200 | 0.0513 |
| 0.0445 | 0.5984 | 300 | 0.0446 |
| 0.0409 | 0.7978 | 400 | 0.0413 |
| 0.0373 | 0.9973 | 500 | 0.0375 |
Framework versions
- PEFT 0.15.2
- Transformers 4.52.3
- Pytorch 2.7.0+cu128
- Datasets 3.6.0
- Tokenizers 0.21.1
prompts used during training
SYSTEM_PROMPT = "<|im_start|>system\nYou are an expert Arabic linguist. Your task is to add diacritics (Tashkeel) to the given Arabic text accurately.<|im_end|>"
USER_PROMPT_TEMPLATE = "<|im_start|>user\nDiacritize the following text:\n{undiacritized_text}<|im_end|>"
ASSISTANT_PROMPT_TEMPLATE = "<|im_start|>assistant\n{diacritized_text}<|im_end|>" # Note: includes EOS token
code for testing
from peft import PeftModel
import re
# --- Load Base Model and Adapter for Inference ---
print("Loading model for inference...")
# Load the base model with quantization (same config as training)
base_model_for_inference = AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=quantization_config, # Use the same BNB config
device_map="auto",
trust_remote_code=True
)
# Load the PEFT adapter
adapter_path = final_adapter_dir # Use the path where the adapter was saved
inference_model = PeftModel.from_pretrained(base_model_for_inference, adapter_path)
inference_model.eval() # Set to evaluation mode
print("Model ready for inference.")
# --- Inference Function ---
def diacritize_text(text, model, tokenizer):
"""Takes undiacritized text and returns the model's diacritized version."""
# Prepare the prompt
undiacritized = re.sub(r'\s+', ' ', text).strip() # Basic cleaning
user_prompt = USER_PROMPT_TEMPLATE.format(undiacritized_text=undiacritized)
# Construct the prompt that the model expects *up to* the assistant's turn
prompt = f"{SYSTEM_PROMPT}\n{user_prompt}\n<|im_start|>assistant\n"
# Tokenize the input
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
# Generate output
# Adjust max_new_tokens based on expected output length; add other generation params as needed
# Using greedy decoding here, consider beam search, top-k, top-p for potentially better results
outputs = model.generate(
**inputs,
max_new_tokens=len(inputs['input_ids'][0])*2, # Allow for diacritics and some buffer
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
top_k=1,
)
# Decode the generated tokens
# Decode only the *newly generated* tokens
generated_ids = outputs[0, inputs['input_ids'].shape[1]:]
decoded_output = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
# Sometimes the model might hallucinate extra text after the main diacritization.
# A simple heuristic: stop at the first newline or excessive length if needed.
# decoded_output = decoded_output.split('\n')[0] # Simple cleanup if needed
return decoded_output
# --- Example Usage ---
input_text = "اكل الولد التفاحة في الحديقة"
print(f"\nInput Text: {input_text}")
diacritized_output = diacritize_text(input_text, inference_model, tokenizer)
print(f"Model Output: {diacritized_output}")
input_text_2 = "بسم الله الرحمن الرحيم"
print(f"\nInput Text: {input_text_2}")
diacritized_output_2 = diacritize_text(input_text_2, inference_model, tokenizer)
print(f"Model Output: {diacritized_output_2}")
input_text_3 = "تعتبر اللغة العربية من أصعب اللغات في العالم"
print(f"\nInput Text: {input_text_3}")
diacritized_output_3 = diacritize_text(input_text_3, inference_model, tokenizer)
print(f"Model Output: {diacritized_output_3}")
train run
https://wandb.ai/bishertello-/uncategorized/runs/y99ptnn7?nw=nwuserbishertello
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support