F5-TTS — Egyptian Arabic Fine-Tune (5K Steps)

A fine-tuned F5-TTS checkpoint based on Habibi-TTS EGY Specialized (flow-matching DiT architecture), trained on 16 hours of Egyptian Arabic conversational speech. This checkpoint was saved at the 5,000-step milestone of a planned 22,000-step run, primarily useful as a research artifact for Arabic TTS experimentation.

Model Details


Base model	Habibi-TTS EGY Specialized (`model_100000.safetensors`)
Architecture	F5-TTS (DiT: dim=1024, depth=22, heads=16)
Method	Full fine-tuning
Language	Arabic — Egyptian dialect
Task	Text-to-Speech
License	CC BY-NC 4.0 (inherited from Habibi-TTS)

Training Configuration

Parameter	Value
Optimizer	AdamW
Learning rate	1e-5
LR schedule	Linear warmup (5,000 steps) → linear decay
Gradient clipping	1.0
Batch size per GPU	90,000 frames (max 32 clips)
Gradient accumulation	2
Precision	BF16 mixed
Hardware	8x NVIDIA A100-80GB (Modal)
VRAM usage	~54 GB / 80 GB per GPU
Steps completed	5,000 of 22,000 planned
Training time	5.24 hours

Dataset


Clips	12,363 (11,745 train / 618 val)
Duration	~16 hours
Language	Egyptian Arabic — conversational speech
Source	Multi-speaker recordings processed through a custom pipeline (source separation, diarization, forced alignment, quality filtering)
Audio format	24,000 Hz mono WAV

Training Results

Metric	Step 0	Step 5,000
Train loss	0.945	0.587
Val loss	—	0.925
Learning rate	2e-8	7.84e-6
Grad norm	—	0.14 (stable)

5K Milestone Evaluation

Metric	Value
UTMOS (mean)	2.63 / 5
WER (Whisper Small, Arabic)	20.4%

Full training logs and audio samples available on Weights & Biases.

Usage

Requires the f5-tts package and a reference audio clip for voice cloning.

from f5_tts.infer.utils_infer import load_model, load_vocoder, preprocess_ref_audio_text, infer_process
from f5_tts.model.utils import get_tokenizer
from hydra.utils import get_class
from omegaconf import OmegaConf
from importlib.resources import files
import soundfile as sf

# Load model config
model_cfg = OmegaConf.load(
    str(files("f5_tts").joinpath("configs/F5TTS_v1_Base.yaml"))
)
model_cls = get_class(f"f5_tts.model.{model_cfg.model.backbone}")
vocoder_name = model_cfg.model.mel_spec.mel_spec_type

# Load fine-tuned checkpoint
model = load_model(
    model_cls, model_cfg.model.arch,
    "model_5000.pt",
    mel_spec_type=vocoder_name,
    vocab_file="vocab.txt",
    device="cuda",
)
vocoder = load_vocoder(vocoder_name=vocoder_name, device="cuda")

# Synthesize
ref_audio, ref_text = preprocess_ref_audio_text("reference.wav", "مرجع الصوت")
audio, sr, _ = infer_process(
    ref_audio=ref_audio,
    ref_text=ref_text,
    gen_text="النهاردَه حنتكلّم عن موضوع مهمّ جدّاً",
    model_obj=model,
    vocoder=vocoder,
    mel_spec_type=vocoder_name,
    nfe_step=32,
    cfg_strength=2.0,
    sway_sampling_coef=-1.0,
    speed=1.0,
    device="cuda",
)
sf.write("output.wav", audio, sr)

Files

File	Description
`model_5000.pt`	Full model checkpoint at step 5,000 (5.4 GB)
`vocab.txt`	Character vocabulary (Habibi-TTS EGY tokenizer)

Limitations

Trained for only 5,000 of a planned 22,000 steps. The run was stopped early at the first milestone evaluation. Loss was still declining and the model had not converged.
The base Habibi-TTS EGY model was pretrained on data that is approximately 85% Modern Standard Arabic (MASC, FLEURS, Omnilingual ASR) with only ~13% genuine Egyptian dialect (MGB-3). This MSA-dominant prior is difficult to overcome with fine-tuning alone — the model tends to produce MSA grammar patterns (e.g., سنتكلم instead of Egyptian حنتكلّم).
Primarily useful as a research artifact for studying Arabic dialect adaptation in flow-matching TTS, not for production Egyptian Arabic synthesis.
UTMOS of 2.63/5 at the 5K checkpoint indicates below-average perceptual quality.
Best results with 6–15 seconds of clean reference audio from the target speaker.

Acknowledgments

Built with F5-TTS, Habibi-TTS, and Modal for cloud GPU training. Experiment tracking via Weights & Biases.

Downloads last month: 22

Model tree for MAdel121/f5-tts-egyptian-arabic

Base model

SWivid/F5-TTS

Finetuned

SWivid/Habibi-TTS

Finetuned

(3)

this model