Automatic Restoration of Diacritics for Speech Data Sets
This is a transformer-baed model for Arabic text diacritization as described here.
Evaluation Results
Evaluation on kssa
DER (Diacritic Error Rate)
| Configuration | With case ending | Without case ending |
|---|---|---|
| Including no diacritic | 10.70% | 7.47% |
| Excluding no diacritic | 12.04% | 7.78% |
WER (Word Error Rate)
| Configuration | With case ending | Without case ending |
|---|---|---|
| Including no diacritic | 36.60% | 21.35% |
| Excluding no diacritic | 34.32% | 18.39% |
How to Use
Installation
git clone https://github.com/rufaelfekadu/diac.git
cd diac
pip install -e .
Loading the Model
from diac.models import DiacritizationModule
model = DiacritizationModule.from_pretrained(
"rufaelfekadu/diac-transformer-text-asr-tashkeela-clartts-kssa",
tokenizer_constants_path="constants/" # Path to constants directory
)
Running Inference
# Predict diacritization for a text file
model.predict_file(
input_file="path/to/input.txt",
output_file="path/to/output.txt"
)
# Or predict for a single text string
diacritized_text = model.predict_text("مرحبا بك")
Running Evaluation
To evaluate the model on your own test set:
- Run inference to generate predictions:
python inference.py \
--config configs/<model>.yml \
--opts \
DATA.TEST_PATH path/to/test.txt \
INFERENCE.MODEL_PATH <path_to_checkpoint> \
INFERENCE.OUTPUT_PATH path/to/predictions.txt
- Prepare reference file (if needed):
python src/diac/utils/prep_ref.py \
--input_file path/to/test.txt \
-o path/to/output_dir
- Calculate metrics (DER, WER, SER):
python src/diac/utils/eval.py \
-ofp path/to/predictions.txt \
-tfp path/to/reference.txt \
--style Fadel
The evaluation script will output DER, WER, and SER metrics with different configurations:
- With/without case ending
- Including/excluding no diacritic
- Downloads last month
- 4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support