🎀 Egyptian Arabic TTS - Chatterbox Fine-tuned

First open-source Egyptian Arabic TTS model based on Chatterbox Multilingual TTS.

Model Description

This model is a strategic partial fine-tuning of Chatterbox for Egyptian Arabic dialect. Unlike standard LoRA fine-tuning which only adapts voice characteristics, this approach deeply learns Egyptian dialect features including:

  • βœ… Egyptian pronunciation (Ω‚ β†’ Ψ‘, Ψ¬ β†’ g)
  • βœ… Natural Egyptian prosody and rhythm
  • βœ… Colloquial Egyptian Arabic patterns
  • βœ… High-quality voice preservation

Training Data

  • Duration: 120 hours of clean Egyptian Arabic
  • Speaker: Single speaker
  • Samples: 43,711 audio segments
  • Quality: Filtered and validated

Fine-tuning Method

Partial Fine-Tuning Strategy:

  • ❄️ Frozen: Text encoder, early acoustic layers, S3Gen codec, voice encoder
  • πŸ”₯ Trainable: Prosody predictors, late acoustic decoder layers
  • πŸ“Š Result: Deep dialect learning without catastrophic forgetting

Training Results

Training Duration: ~14 hours (1 epoch, 73% complete)
Initial Loss: ~6.0
Final Loss: 4.57
Checkpoint: 2000 steps

Usage

Installation

pip install torch torchaudio
pip install chatterbox-tts
pip install safetensors soundfile huggingface_hub

Basic Inference

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Load base model
model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Download and load fine-tuned checkpoint
checkpoint_path = hf_hub_download(
    repo_id="AliAbdallah/egyptian-arabic-tts-chatterbox",
    filename="model.safetensors"
)
checkpoint = load_file(checkpoint_path, device="cuda")
model.t3.load_state_dict(checkpoint, strict=False)

# Set to evaluation mode
model.t3.eval()
model.s3gen.eval()
model.ve.eval()

# Generate speech
text = "Ω…Ψ±Ψ­Ψ¨Ψ§ ΩƒΩŠΩ Ψ­Ψ§Ω„Ωƒ Ψ§Ω„ΩŠΩˆΩ…"
wav = model.generate(
    text,
    language_id="ar",
    exaggeration=0.5,
    cfg_weight=0.5,
    temperature=0.8,
)

# Save audio
import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), model.sr)

Model Details

  • Base Model: Chatterbox Multilingual TTS (567M parameters)
  • Fine-tuned Layers: Prosody predictors + late acoustic decoder
  • Languages: Primarily Egyptian Arabic (base model supports 23 languages)
  • Sample Rate: 24kHz
  • Architecture: T3 transformer + S3Gen codec

Limitations

  • Optimized for Egyptian Arabic dialect
  • Single speaker
  • Requires GPU for real-time inference
  • May not perform well on non-Egyptian Arabic text

Citation

@misc{abdallah2026egyptian,
  title={Egyptian Arabic TTS Fine-Tuning with Chatterbox},
  author={Ali Abdallah},
  year={2026},
  url={https://huggingface.co/AliAbdallah/egyptian-arabic-tts-chatterbox}
}

Links

License

Apache 2.0 (following Chatterbox base model license)

Acknowledgments

  • Chatterbox team for the excellent base model
  • Egyptian Arabic NLP community
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using AliAbdallah/egyptian-arabic-tts-chatterbox 1