SILMA TTS: A Lightweight Open Bilingual Text to Speech Model

Community Article Published March 15, 2026

Today, SILMA AI is excited to introduce SILMA TTS v1 — a powerful, 150M-parameter bilingual text-to-speech model seamlessly bridging Arabic and English. Built on the innovative F5-TTS diffusion architecture, this model was pre-trained from scratch using tens of thousands of hours of meticulously curated public and proprietary audio. Because we believe in giving back to the open-source community, we are releasing SILMA TTS under a highly permissive license, making state-of-the-art speech synthesis fully accessible for both research and commercial applications.

silma-tts-v1

Model Features

Below are the main features of the model

Feature Description
High-Fidelity Audio Superior speech synthesis with high-quality output
150M Parameters Lightweight and efficient, works well in low-resource environments
Instant Voice Cloning Clone any voice with less than 8 seconds of reference audio
Ultra-Low Latency Optimized for real-time applications with RTF around 0.12 (RTX 4090 GPU)
Bilingual Arabic & English Support Native-level fluency for Arabic Fusha/MSA as well as English
Advanced Arabic Diacritization Full support for Tashkeel to ensure precise pronunciation and context
Text Normalization Utilizing NeMo Text Processing
Commercial-Friendly Licensing Fully open-source under the Apache 2.0 License

Why did we release SILMA TTS

SILMA AI has always been deeply committed to open-source development and community impact. However, the release of this specific model was driven by two key factors:

Filling the gap for a low-resource language: Arabic is spoken by over 400+ million people globally, yet it remains unsupported by many TTS models due to the scarcity of high-quality Arabic audio data on the internet.

Improving upon F5-TTS size and licensing constraints: Because the original F5-TTS model was only available in one size and restricted from commercial use, we wanted to give back to the community by providing a more lightweight model (150M parameters) under a highly permissive license (Apache-2.0).

How We Built the Model

Architecture Optimization: First, we reduced the number of parameters in the original F5-TTS architecture from over 300M to approximately 150M.

Extensive Pretraining: Next, we pretrained the weights from scratch over a full week using 8 GPUs, training on tens of thousands of hours of high-quality data.

Targeted Fine-Tuning: We followed pretraining with a specialized fine-tuning phase, using a smaller, curated dataset with more focus on Arabic.

Inference Enhancements: Finally, we introduced significant optimizations to the inference code to improve Arabic text handling, chunking, text normalization, overall robustness, and final audio quality.

Try it out

You can easily try SILMA TTS v1 using only 2 lines of code:

pip install silma-tts
silma-tts-app

Here are all the relevant links for you to learn more about the model:

  1. Github Repo: https://github.com/SILMA-AI/silma-tts
  2. Hugging Face Repo: https://huggingface.co/silma-ai/silma-tts
  3. Hugging Face Demo Space: https://huggingface.co/spaces/silma-ai/silma-tts-v1-demo

We are excited to see how you will benefit from and build upon SILMA TTS!

Community

Sign up or log in to comment