OmniVoice Georgian — Fine-Tuned
A fine-tuned variant of k2-fsa/OmniVoice for Georgian (ქართული) text-to-speech with zero-shot voice cloning support. Trained on an Georgian speech corpus (~150 hours, 29 speakers) using a low learning rate (2e-5) to preserve the multilingual capabilities of the base model.
📊 Benchmark report: github.com/NMikaa/TTS_pipelines 🛠️ Training recipe: pipelines/omnivoice/README.md
Why use this model?
Compared to the pretrained OmniVoice baseline, this fine-tuned variant offers:
Robust voice generation across genders. Pretrained OmniVoice tends to collapse toward overrepresented Common Voice speakers — when given a male settings, it sometimes generates female output. This model produces both male and female voices reliably from any Georgian reference.
No language leakage in cross-lingual cloning. Verified 0% Georgian Unicode leakage in English ASR transcripts when using a Georgian reference for English text generation.
Subjective Georgian quality. Native Georgian listener feedback indicates clearly more authentic phonation, vowel quality, and intonation compared to pretrained.
Matches pretrained intelligibility. Within 0.03% CER on FLEURS Georgian (1.61% vs 1.64%) — statistical noise — while delivering the qualitative improvements above.
Installation
OmniVoice is not on PyPI yet — you must install it from source first. Then this model can be loaded directly via from_pretrained.
# Install OmniVoice from source (requires Python 3.10+, torch 2.8+ recommended)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# Optional: install datasets if you want to evaluate
pip install datasets soundfile torchaudio
GPU recommended. Inference works on CPU but is much slower. ~3GB VRAM is enough for inference in fp16. Note about torchaudio 2.11+: torchaudio 2.11 removed all backends except
torchcodec. If you hitImportError: TorchCodec is required, either installtorchcodec(pip install torchcodec) or downgrade totorchaudio==2.8.0.
Quick Start
import torch
from omnivoice import OmniVoice
from omnivoice.models.omnivoice import OmniVoiceGenerationConfig
import torchaudio
model = OmniVoice.from_pretrained(
"NMikka/omnivoice-finetuned-ka",
device_map="cuda:0",
dtype=torch.float16,
load_asr=True, # auto-transcribe reference audio
)
# Voice cloning with reference
prompt = model.create_voice_clone_prompt(
ref_audio="reference.wav",
ref_text=None, # auto-transcribed if None
)
result = model.generate(
text="გამარჯობა, ეს არის ქართული ტექსტი.",
language="Georgian",
voice_clone_prompt=prompt,
generation_config=OmniVoiceGenerationConfig(
num_step=32,
guidance_scale=2.0,
),
)
torchaudio.save("output.wav", result[0].cpu(), 24000)
For voice design (without reference audio), use instruct= instead of voice_clone_prompt=:
result = model.generate(
text="გამარჯობა, ეს არის ქართული ტექსტი.",
language="Georgian",
instruct="female, young adult",
generation_config=OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0),
)
Benchmark Results (FLEURS Georgian, 979 samples, normalized text)
| Model | FL-CER ↓ | FL-MOS ↑ | TTSDS Pitch ↑ | TTSDS SR ↑ | Voice Cloning |
|---|---|---|---|---|---|
| This model (099v2_ckpt480) | 1.61% | 2.920 | 82.07 | 75.51 | ✅ Robust both genders |
| OmniVoice pretrained | 1.64% | 2.749 | 85.64 | 76.34 | ⚠️ Speaker collapse |
See the full benchmark report for TTSDS prosody analysis, per-checkpoint ablations, and discussion of the metrics-vs-listening gap for low-resource languages.
Training Details
| Parameter | Value |
|---|---|
| Base model | k2-fsa/OmniVoice (Qwen3-0.6B + HiggsAudioV2 codec, 600M params) |
| Training data | ~100 hours of Georgian Speech Data |
| Quality threshold | 0.99 text-audio match ratio |
| Learning rate | 2e-5 (lower than official 5e-5 to preserve cross-lingual capabilities) |
| Warmup ratio | 0.01 |
| Steps | 480 (~2 epochs) |
| Batch size | 4096 tokens × 2 GPUs × 4 grad accum = ~32k effective |
| Mixed precision | bf16 |
| Hardware | 2 × A6000 (48GB) |
Critical findings:
- lr=2e-5 is the right choice — the official 5e-5 caused too much catastrophic forgetting of cross-lingual capability
- Best checkpoint is at ~2 epochs (step 480). Beyond epoch 4, English regression starts to appear
- 0% language leakage in cross-lingual cloning (verified on FLEURS English with Georgian reference)
Limitations
- No native human MOS evaluation — the gold standard for TTS quality. We rely on automatic metrics + native listener feedback (anecdotal). A formal MOS study would be the next step.
- Subjective improvements are not fully captured by automatic metrics — TTSDS Pitch and CER scores are slightly worse than pretrained, but Georgian listeners report clearly better quality. This is documented in detail in the benchmark report.
- Speaker similarity decreased slightly vs pretrained (0.683 vs 0.717 on CV-GEO with WavLM-ECAPA) — typical artifact of fine-tuning a multilingual model on a narrower distribution.
- English quality unchanged or slightly improved but cross-lingual prosody quality not formally verified beyond CER.
Intended Use
- Georgian text-to-speech with voice cloning
- Multilingual applications requiring Georgian + English synthesis
- Research on fine-tuning multilingual TTS models for low-resource languages
Out of scope:
- Highest possible Georgian intelligibility — for that, see MagPIE TTS Georgian
- Languages other than Georgian and English (the base model supports 646 languages but fine-tuning was only on Georgian; other languages may degrade slightly (I HOPE NOT :) )
Citation
@misc{omnivoice-georgian-2026,
author = {Mikaberidze, Nika},
title = {OmniVoice Georgian: Fine-tuning and Benchmark for Georgian Text-to-Speech},
year = {2026},
publisher = {GitHub},
url = {https://github.com/NMikaa/TTS_pipelines}
}
Please also cite the base OmniVoice paper:
@article{omnivoice2026,
title = {OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author = {Zhu, Han and others},
year = {2026},
url = {https://arxiv.org/abs/2604.00688}
}
License
Apache 2.0, inherited from the base k2-fsa/OmniVoice.
Acknowledgments
- k2-fsa/OmniVoice — base model and training framework
- Meta Omnilingual ASR — used for round-trip evaluation
- TTSDS — distribution-based prosody evaluation
- Mozilla Common Voice and Google FLEURS — public Georgian datasets used for evaluation
- Downloads last month
- 434