🎤 Egyptian Arabic TTS - Chatterbox Fine-tuned

First open-source Egyptian Arabic TTS model based on Chatterbox Multilingual TTS.

Model Description

This model is a strategic partial fine-tuning of Chatterbox for Egyptian Arabic dialect. Unlike standard LoRA fine-tuning which only adapts voice characteristics, this approach deeply learns Egyptian dialect features including:

✅ Egyptian pronunciation (ق → ء, ج → g)
✅ Natural Egyptian prosody and rhythm
✅ Colloquial Egyptian Arabic patterns
✅ High-quality voice preservation

Training Data

Duration: 120 hours of clean Egyptian Arabic
Speaker: Single speaker
Samples: 43,711 audio segments
Quality: Filtered and validated

Fine-tuning Method

Partial Fine-Tuning Strategy:

❄️ Frozen: Text encoder, early acoustic layers, S3Gen codec, voice encoder
🔥 Trainable: Prosody predictors, late acoustic decoder layers
📊 Result: Deep dialect learning without catastrophic forgetting

Training Results

Training Duration: ~14 hours (1 epoch, 73% complete)
Initial Loss: ~6.0
Final Loss: 4.57
Checkpoint: 2000 steps

Usage

Installation

pip install torch torchaudio
pip install chatterbox-tts
pip install safetensors soundfile huggingface_hub

Basic Inference

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Download and load fine-tuned checkpoint
checkpoint_path = hf_hub_download(
    repo_id="AliAbdallah/egyptian-arabic-tts-chatterbox",
    filename="model.safetensors"
)
checkpoint = load_file(checkpoint_path, device="cuda")
model.t3.load_state_dict(checkpoint, strict=False)

# Set to evaluation mode
model.t3.eval()
model.s3gen.eval()
model.ve.eval()

# Generate speech
text = "مرحبا كيف حالك اليوم"
wav = model.generate(
    text,
    language_id="ar",
    exaggeration=0.5,
    cfg_weight=0.5,
    temperature=0.8,
)

# Save audio
import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), model.sr)

Model Details

Base Model: Chatterbox Multilingual TTS (567M parameters)
Fine-tuned Layers: Prosody predictors + late acoustic decoder
Languages: Primarily Egyptian Arabic (base model supports 23 languages)
Sample Rate: 24kHz
Architecture: T3 transformer + S3Gen codec

Limitations

Optimized for Egyptian Arabic dialect
Single speaker
Requires GPU for real-time inference
May not perform well on non-Egyptian Arabic text

Citation

@misc{abdallah2026egyptian,
  title={Egyptian Arabic TTS Fine-Tuning with Chatterbox},
  author={Ali Abdallah},
  year={2026},
  url={https://huggingface.co/AliAbdallah/egyptian-arabic-tts-chatterbox}
}

License

Apache 2.0 (following Chatterbox base model license)

Acknowledgments

Chatterbox team for the excellent base model
Egyptian Arabic NLP community

Downloads last month: -

AliAbdallah
/

egyptian-arabic-tts-chatterbox