JALAK - Indonesian Text-to-Speech Model
Fine-tuned VITS model for Indonesian TTS using the X-lord Indonesian dataset.
Model Details
- Base Model: Wikidepia VITS
- Dataset: X-lord/Dataset-Text-To-Speech-Indonesia (4,531 samples, 16.38 hours)
- Training: 1000 epochs
- Sample Rate: 22050 Hz
- Language: Indonesian
Usage
import torch
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.models.vits import Vits
import soundfile as sf
# Load config
config = VitsConfig()
config.load_json('config.json')
config.speakers_file = None
config.use_speaker_embedding = False
config.num_speakers = 0
# Load model
model = Vits.init_from_config(config)
model.load_checkpoint(config, 'best_model.pth')
model.eval()
if torch.cuda.is_available():
model = model.cuda()
# Generate speech
text = "Selamat pagi, bagaimana kabar Anda?"
token_ids = model.tokenizer.text_to_ids(text)
token_ids = torch.LongTensor(token_ids).unsqueeze(0)
if torch.cuda.is_available():
token_ids = token_ids.cuda()
with torch.no_grad():
outputs = model.inference(token_ids)
waveform = outputs['model_outputs'].squeeze().cpu().numpy()
sf.write('output.wav', waveform, 22050)
Training Metrics
| Metric | Value |
|---|---|
| Loss disc | 2.56 |
| Loss gen | 2.15 |
| Loss mel | 15.22 |
| Epochs | 1000 |
License
MIT License
Credits
- Base model: Wikidepia Indonesian TTS
- Dataset: X-lord/Dataset-Text-To-Speech-Indonesia
- Framework: Coqui TTS
- Downloads last month
- 31