F5-TTS โ Egyptian Arabic Fine-Tune (5K Steps)
A fine-tuned F5-TTS checkpoint based on Habibi-TTS EGY Specialized (flow-matching DiT architecture), trained on 16 hours of Egyptian Arabic conversational speech. This checkpoint was saved at the 5,000-step milestone of a planned 22,000-step run, primarily useful as a research artifact for Arabic TTS experimentation.
Model Details
|
|
| Base model |
Habibi-TTS EGY Specialized (model_100000.safetensors) |
| Architecture |
F5-TTS (DiT: dim=1024, depth=22, heads=16) |
| Method |
Full fine-tuning |
| Language |
Arabic โ Egyptian dialect |
| Task |
Text-to-Speech |
| License |
CC BY-NC 4.0 (inherited from Habibi-TTS) |
Training Configuration
| Parameter |
Value |
| Optimizer |
AdamW |
| Learning rate |
1e-5 |
| LR schedule |
Linear warmup (5,000 steps) โ linear decay |
| Gradient clipping |
1.0 |
| Batch size per GPU |
90,000 frames (max 32 clips) |
| Gradient accumulation |
2 |
| Precision |
BF16 mixed |
| Hardware |
8x NVIDIA A100-80GB (Modal) |
| VRAM usage |
~54 GB / 80 GB per GPU |
| Steps completed |
5,000 of 22,000 planned |
| Training time |
5.24 hours |
Dataset
|
|
| Clips |
12,363 (11,745 train / 618 val) |
| Duration |
~16 hours |
| Language |
Egyptian Arabic โ conversational speech |
| Source |
Multi-speaker recordings processed through a custom pipeline (source separation, diarization, forced alignment, quality filtering) |
| Audio format |
24,000 Hz mono WAV |
Training Results
| Metric |
Step 0 |
Step 5,000 |
| Train loss |
0.945 |
0.587 |
| Val loss |
โ |
0.925 |
| Learning rate |
2e-8 |
7.84e-6 |
| Grad norm |
โ |
0.14 (stable) |
5K Milestone Evaluation
| Metric |
Value |
| UTMOS (mean) |
2.63 / 5 |
| WER (Whisper Small, Arabic) |
20.4% |
Full training logs and audio samples available on Weights & Biases.
Usage
Requires the f5-tts package and a reference audio clip for voice cloning.
from f5_tts.infer.utils_infer import load_model, load_vocoder, preprocess_ref_audio_text, infer_process
from f5_tts.model.utils import get_tokenizer
from hydra.utils import get_class
from omegaconf import OmegaConf
from importlib.resources import files
import soundfile as sf
model_cfg = OmegaConf.load(
str(files("f5_tts").joinpath("configs/F5TTS_v1_Base.yaml"))
)
model_cls = get_class(f"f5_tts.model.{model_cfg.model.backbone}")
vocoder_name = model_cfg.model.mel_spec.mel_spec_type
model = load_model(
model_cls, model_cfg.model.arch,
"model_5000.pt",
mel_spec_type=vocoder_name,
vocab_file="vocab.txt",
device="cuda",
)
vocoder = load_vocoder(vocoder_name=vocoder_name, device="cuda")
ref_audio, ref_text = preprocess_ref_audio_text("reference.wav", "ู
ุฑุฌุน ุงูุตูุช")
audio, sr, _ = infer_process(
ref_audio=ref_audio,
ref_text=ref_text,
gen_text="ุงูููุงุฑุฏูู ุญูุชูููู
ุนู ู
ูุถูุน ู
ูู
ู ุฌุฏูุงู",
model_obj=model,
vocoder=vocoder,
mel_spec_type=vocoder_name,
nfe_step=32,
cfg_strength=2.0,
sway_sampling_coef=-1.0,
speed=1.0,
device="cuda",
)
sf.write("output.wav", audio, sr)
Files
| File |
Description |
model_5000.pt |
Full model checkpoint at step 5,000 (5.4 GB) |
vocab.txt |
Character vocabulary (Habibi-TTS EGY tokenizer) |
Limitations
- Trained for only 5,000 of a planned 22,000 steps. The run was stopped early at the first milestone evaluation. Loss was still declining and the model had not converged.
- The base Habibi-TTS EGY model was pretrained on data that is approximately 85% Modern Standard Arabic (MASC, FLEURS, Omnilingual ASR) with only ~13% genuine Egyptian dialect (MGB-3). This MSA-dominant prior is difficult to overcome with fine-tuning alone โ the model tends to produce MSA grammar patterns (e.g., ุณูุชููู
instead of Egyptian ุญูุชูููู
).
- Primarily useful as a research artifact for studying Arabic dialect adaptation in flow-matching TTS, not for production Egyptian Arabic synthesis.
- UTMOS of 2.63/5 at the 5K checkpoint indicates below-average perceptual quality.
- Best results with 6โ15 seconds of clean reference audio from the target speaker.
Acknowledgments
Built with F5-TTS, Habibi-TTS, and Modal for cloud GPU training. Experiment tracking via Weights & Biases.