π€ Egyptian Arabic TTS - Chatterbox Fine-tuned
First open-source Egyptian Arabic TTS model based on Chatterbox Multilingual TTS.
Model Description
This model is a strategic partial fine-tuning of Chatterbox for Egyptian Arabic dialect. Unlike standard LoRA fine-tuning which only adapts voice characteristics, this approach deeply learns Egyptian dialect features including:
- β Egyptian pronunciation (Ω β Ψ‘, Ψ¬ β g)
- β Natural Egyptian prosody and rhythm
- β Colloquial Egyptian Arabic patterns
- β High-quality voice preservation
Training Data
- Duration: 120 hours of clean Egyptian Arabic
- Speaker: Single speaker
- Samples: 43,711 audio segments
- Quality: Filtered and validated
Fine-tuning Method
Partial Fine-Tuning Strategy:
- βοΈ Frozen: Text encoder, early acoustic layers, S3Gen codec, voice encoder
- π₯ Trainable: Prosody predictors, late acoustic decoder layers
- π Result: Deep dialect learning without catastrophic forgetting
Training Results
Training Duration: ~14 hours (1 epoch, 73% complete)
Initial Loss: ~6.0
Final Loss: 4.57
Checkpoint: 2000 steps
Usage
Installation
pip install torch torchaudio
pip install chatterbox-tts
pip install safetensors soundfile huggingface_hub
Basic Inference
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")
# Download and load fine-tuned checkpoint
checkpoint_path = hf_hub_download(
repo_id="AliAbdallah/egyptian-arabic-tts-chatterbox",
filename="model.safetensors"
)
checkpoint = load_file(checkpoint_path, device="cuda")
model.t3.load_state_dict(checkpoint, strict=False)
# Set to evaluation mode
model.t3.eval()
model.s3gen.eval()
model.ve.eval()
# Generate speech
text = "Ω
Ψ±ΨΨ¨Ψ§ ΩΩΩ ΨΨ§ΩΩ Ψ§ΩΩΩΩ
"
wav = model.generate(
text,
language_id="ar",
exaggeration=0.5,
cfg_weight=0.5,
temperature=0.8,
)
# Save audio
import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), model.sr)
Model Details
- Base Model: Chatterbox Multilingual TTS (567M parameters)
- Fine-tuned Layers: Prosody predictors + late acoustic decoder
- Languages: Primarily Egyptian Arabic (base model supports 23 languages)
- Sample Rate: 24kHz
- Architecture: T3 transformer + S3Gen codec
Limitations
- Optimized for Egyptian Arabic dialect
- Single speaker
- Requires GPU for real-time inference
- May not perform well on non-Egyptian Arabic text
Citation
@misc{abdallah2026egyptian,
title={Egyptian Arabic TTS Fine-Tuning with Chatterbox},
author={Ali Abdallah},
year={2026},
url={https://huggingface.co/AliAbdallah/egyptian-arabic-tts-chatterbox}
}
Links
- GitHub Repository: Full code and training pipeline
- Base Model: Chatterbox
- Demo: Try it here
License
Apache 2.0 (following Chatterbox base model license)
Acknowledgments
- Chatterbox team for the excellent base model
- Egyptian Arabic NLP community
- Downloads last month
- -