--- language: - sr tags: - text-to-speech - tts - f5-tts - serbian license: mit base_model: - SWivid/F5-TTS pipeline_tag: text-to-speech --- # F5-TTS Serbian A Serbian TTS model based on [F5-TTS](https://github.com/SWivid/F5-TTS), trained from scratch on a Serbian speech dataset. This model is not production ready, still halucinates. Its just a test. ## Model Details | Property | Value | |---|---| | Architecture | F5TTS_v1_Base | | Tokenizer | char | | Training | from scratch (not finetuned) | | Mixed precision | bf16 | | Dataset | 60,948 samples / 132.05 hours | | Steps | 430,000 | | Epochs | 434 | | GPU | NVIDIA A40 (46GB) | ## Training Config ```yaml exp_name: F5TTS_v1_Base tokenizer: char mixed_precision: bf16 learning_rate: 7.5e-05 batch_size_per_gpu: 20189 batch_size_type: frame max_samples: 64 grad_accumulation_steps: 1 max_grad_norm: 1 epochs: 434 num_warmup_updates: 3779 save_per_updates: 5000 keep_last_n_checkpoints: 1 last_per_updates: 10000 logger: tensorboard ``` ## Training Curves **Loss** ![loss curve](https://i.imgur.com/jgpIUgR.png) **Learning Rate** ![learning rate](https://i.imgur.com/l2w2Q7x.png) ## Checkpoint The checkpoint contains only the EMA model weights (`ema_model_state_dict`), stripped of optimizer and scheduler states for minimal file size. ## Usage Load with F5-TTS: ```python import torch from f5_tts.model import DiT from f5_tts.infer.utils_infer import load_checkpoint ckpt = torch.load("model_430000.pt", map_location="cpu") model_state = ckpt["ema_model_state_dict"] ```