F5-TTS โ€” Egyptian Arabic Fine-Tune (5K Steps)

A fine-tuned F5-TTS checkpoint based on Habibi-TTS EGY Specialized (flow-matching DiT architecture), trained on 16 hours of Egyptian Arabic conversational speech. This checkpoint was saved at the 5,000-step milestone of a planned 22,000-step run, primarily useful as a research artifact for Arabic TTS experimentation.

Model Details

Base model Habibi-TTS EGY Specialized (model_100000.safetensors)
Architecture F5-TTS (DiT: dim=1024, depth=22, heads=16)
Method Full fine-tuning
Language Arabic โ€” Egyptian dialect
Task Text-to-Speech
License CC BY-NC 4.0 (inherited from Habibi-TTS)

Training Configuration

Parameter Value
Optimizer AdamW
Learning rate 1e-5
LR schedule Linear warmup (5,000 steps) โ†’ linear decay
Gradient clipping 1.0
Batch size per GPU 90,000 frames (max 32 clips)
Gradient accumulation 2
Precision BF16 mixed
Hardware 8x NVIDIA A100-80GB (Modal)
VRAM usage ~54 GB / 80 GB per GPU
Steps completed 5,000 of 22,000 planned
Training time 5.24 hours

Dataset

Clips 12,363 (11,745 train / 618 val)
Duration ~16 hours
Language Egyptian Arabic โ€” conversational speech
Source Multi-speaker recordings processed through a custom pipeline (source separation, diarization, forced alignment, quality filtering)
Audio format 24,000 Hz mono WAV

Training Results

Metric Step 0 Step 5,000
Train loss 0.945 0.587
Val loss โ€” 0.925
Learning rate 2e-8 7.84e-6
Grad norm โ€” 0.14 (stable)

5K Milestone Evaluation

Metric Value
UTMOS (mean) 2.63 / 5
WER (Whisper Small, Arabic) 20.4%

Full training logs and audio samples available on Weights & Biases.

Usage

Requires the f5-tts package and a reference audio clip for voice cloning.

from f5_tts.infer.utils_infer import load_model, load_vocoder, preprocess_ref_audio_text, infer_process
from f5_tts.model.utils import get_tokenizer
from hydra.utils import get_class
from omegaconf import OmegaConf
from importlib.resources import files
import soundfile as sf

# Load model config
model_cfg = OmegaConf.load(
    str(files("f5_tts").joinpath("configs/F5TTS_v1_Base.yaml"))
)
model_cls = get_class(f"f5_tts.model.{model_cfg.model.backbone}")
vocoder_name = model_cfg.model.mel_spec.mel_spec_type

# Load fine-tuned checkpoint
model = load_model(
    model_cls, model_cfg.model.arch,
    "model_5000.pt",
    mel_spec_type=vocoder_name,
    vocab_file="vocab.txt",
    device="cuda",
)
vocoder = load_vocoder(vocoder_name=vocoder_name, device="cuda")

# Synthesize
ref_audio, ref_text = preprocess_ref_audio_text("reference.wav", "ู…ุฑุฌุน ุงู„ุตูˆุช")
audio, sr, _ = infer_process(
    ref_audio=ref_audio,
    ref_text=ref_text,
    gen_text="ุงู„ู†ู‡ุงุฑุฏูŽู‡ ุญู†ุชูƒู„ู‘ู… ุนู† ู…ูˆุถูˆุน ู…ู‡ู…ู‘ ุฌุฏู‘ุงู‹",
    model_obj=model,
    vocoder=vocoder,
    mel_spec_type=vocoder_name,
    nfe_step=32,
    cfg_strength=2.0,
    sway_sampling_coef=-1.0,
    speed=1.0,
    device="cuda",
)
sf.write("output.wav", audio, sr)

Files

File Description
model_5000.pt Full model checkpoint at step 5,000 (5.4 GB)
vocab.txt Character vocabulary (Habibi-TTS EGY tokenizer)

Limitations

  • Trained for only 5,000 of a planned 22,000 steps. The run was stopped early at the first milestone evaluation. Loss was still declining and the model had not converged.
  • The base Habibi-TTS EGY model was pretrained on data that is approximately 85% Modern Standard Arabic (MASC, FLEURS, Omnilingual ASR) with only ~13% genuine Egyptian dialect (MGB-3). This MSA-dominant prior is difficult to overcome with fine-tuning alone โ€” the model tends to produce MSA grammar patterns (e.g., ุณู†ุชูƒู„ู… instead of Egyptian ุญู†ุชูƒู„ู‘ู…).
  • Primarily useful as a research artifact for studying Arabic dialect adaptation in flow-matching TTS, not for production Egyptian Arabic synthesis.
  • UTMOS of 2.63/5 at the 5K checkpoint indicates below-average perceptual quality.
  • Best results with 6โ€“15 seconds of clean reference audio from the target speaker.

Acknowledgments

Built with F5-TTS, Habibi-TTS, and Modal for cloud GPU training. Experiment tracking via Weights & Biases.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MAdel121/f5-tts-egyptian-arabic

Base model

SWivid/F5-TTS
Finetuned
(3)
this model