---
language:
- sr
tags:
- text-to-speech
- tts
- f5-tts
- serbian
license: mit
base_model:
- SWivid/F5-TTS
pipeline_tag: text-to-speech
---

# F5-TTS Serbian

A Serbian TTS model based on [F5-TTS](https://github.com/SWivid/F5-TTS), trained from scratch on a Serbian speech dataset.
This model is not production ready, still halucinates. Its just a test.

## Model Details

| Property | Value |
|---|---|
| Architecture | F5TTS_v1_Base |
| Tokenizer | char |
| Training | from scratch (not finetuned) |
| Mixed precision | bf16 |
| Dataset | 60,948 samples / 132.05 hours |
| Steps | 430,000 |
| Epochs | 434 |
| GPU | NVIDIA A40 (46GB) |

## Training Config

```yaml
exp_name: F5TTS_v1_Base
tokenizer: char
mixed_precision: bf16
learning_rate: 7.5e-05
batch_size_per_gpu: 20189
batch_size_type: frame
max_samples: 64
grad_accumulation_steps: 1
max_grad_norm: 1
epochs: 434
num_warmup_updates: 3779
save_per_updates: 5000
keep_last_n_checkpoints: 1
last_per_updates: 10000
logger: tensorboard
```

## Training Curves

**Loss**
![loss curve](https://i.imgur.com/jgpIUgR.png)

**Learning Rate**
![learning rate](https://i.imgur.com/l2w2Q7x.png)

## Checkpoint

The checkpoint contains only the EMA model weights (`ema_model_state_dict`), stripped of optimizer and scheduler states for minimal file size.

## Usage

Load with F5-TTS:

```python
import torch
from f5_tts.model import DiT
from f5_tts.infer.utils_infer import load_checkpoint

ckpt = torch.load("model_430000.pt", map_location="cpu")
model_state = ckpt["ema_model_state_dict"]
```