OmniVoice Georgian — Fine-Tuned

A fine-tuned variant of k2-fsa/OmniVoice for Georgian (ქართული) text-to-speech with zero-shot voice cloning support. Trained on an Georgian speech corpus (~150 hours, 29 speakers) using a low learning rate (2e-5) to preserve the multilingual capabilities of the base model.

📊 Benchmark report: github.com/NMikaa/TTS_pipelines 🛠️ Training recipe: pipelines/omnivoice/README.md

Why use this model?

Compared to the pretrained OmniVoice baseline, this fine-tuned variant offers:

Robust voice generation across genders. Pretrained OmniVoice tends to collapse toward overrepresented Common Voice speakers — when given a male settings, it sometimes generates female output. This model produces both male and female voices reliably from any Georgian reference.
No language leakage in cross-lingual cloning. Verified 0% Georgian Unicode leakage in English ASR transcripts when using a Georgian reference for English text generation.
Subjective Georgian quality. Native Georgian listener feedback indicates clearly more authentic phonation, vowel quality, and intonation compared to pretrained.
Matches pretrained intelligibility. Within 0.03% CER on FLEURS Georgian (1.61% vs 1.64%) — statistical noise — while delivering the qualitative improvements above.

Installation

OmniVoice is not on PyPI yet — you must install it from source first. Then this model can be loaded directly via from_pretrained.

# Install OmniVoice from source (requires Python 3.10+, torch 2.8+ recommended)
pip install git+https://github.com/k2-fsa/OmniVoice.git

# Optional: install datasets if you want to evaluate
pip install datasets soundfile torchaudio

GPU recommended. Inference works on CPU but is much slower. ~3GB VRAM is enough for inference in fp16. Note about torchaudio 2.11+: torchaudio 2.11 removed all backends except torchcodec. If you hit ImportError: TorchCodec is required, either install torchcodec (pip install torchcodec) or downgrade to torchaudio==2.8.0.

Quick Start

import torch
from omnivoice import OmniVoice
from omnivoice.models.omnivoice import OmniVoiceGenerationConfig
import torchaudio

model = OmniVoice.from_pretrained(
    "NMikka/omnivoice-finetuned-ka",
    device_map="cuda:0",
    dtype=torch.float16,
    load_asr=True,  # auto-transcribe reference audio
)

# Voice cloning with reference
prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text=None,  # auto-transcribed if None
)

result = model.generate(
    text="გამარჯობა, ეს არის ქართული ტექსტი.",
    language="Georgian",
    voice_clone_prompt=prompt,
    generation_config=OmniVoiceGenerationConfig(
        num_step=32,
        guidance_scale=2.0,
    ),
)

torchaudio.save("output.wav", result[0].cpu(), 24000)

For voice design (without reference audio), use instruct= instead of voice_clone_prompt=:

result = model.generate(
    text="გამარჯობა, ეს არის ქართული ტექსტი.",
    language="Georgian",
    instruct="female, young adult",
    generation_config=OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0),
)

Benchmark Results (FLEURS Georgian, 979 samples, normalized text)

Model	FL-CER ↓	FL-MOS ↑	TTSDS Pitch ↑	TTSDS SR ↑	Voice Cloning
This model (099v2_ckpt480)	1.61%	2.920	82.07	75.51	✅ Robust both genders
OmniVoice pretrained	1.64%	2.749	85.64	76.34	⚠️ Speaker collapse

See the full benchmark report for TTSDS prosody analysis, per-checkpoint ablations, and discussion of the metrics-vs-listening gap for low-resource languages.

Training Details

Parameter	Value
Base model	k2-fsa/OmniVoice (Qwen3-0.6B + HiggsAudioV2 codec, 600M params)
Training data	~100 hours of Georgian Speech Data
Quality threshold	0.99 text-audio match ratio
Learning rate	2e-5 (lower than official 5e-5 to preserve cross-lingual capabilities)
Warmup ratio	0.01
Steps	480 (~2 epochs)
Batch size	4096 tokens × 2 GPUs × 4 grad accum = ~32k effective
Mixed precision	bf16
Hardware	2 × A6000 (48GB)

Critical findings:

lr=2e-5 is the right choice — the official 5e-5 caused too much catastrophic forgetting of cross-lingual capability
Best checkpoint is at ~2 epochs (step 480). Beyond epoch 4, English regression starts to appear
0% language leakage in cross-lingual cloning (verified on FLEURS English with Georgian reference)

Limitations

No native human MOS evaluation — the gold standard for TTS quality. We rely on automatic metrics + native listener feedback (anecdotal). A formal MOS study would be the next step.
Subjective improvements are not fully captured by automatic metrics — TTSDS Pitch and CER scores are slightly worse than pretrained, but Georgian listeners report clearly better quality. This is documented in detail in the benchmark report.
Speaker similarity decreased slightly vs pretrained (0.683 vs 0.717 on CV-GEO with WavLM-ECAPA) — typical artifact of fine-tuning a multilingual model on a narrower distribution.
English quality unchanged or slightly improved but cross-lingual prosody quality not formally verified beyond CER.

Intended Use

Georgian text-to-speech with voice cloning
Multilingual applications requiring Georgian + English synthesis
Research on fine-tuning multilingual TTS models for low-resource languages

Out of scope:

Highest possible Georgian intelligibility — for that, see MagPIE TTS Georgian
Languages other than Georgian and English (the base model supports 646 languages but fine-tuning was only on Georgian; other languages may degrade slightly (I HOPE NOT :) )

Citation

@misc{omnivoice-georgian-2026,
  author = {Mikaberidze, Nika},
  title = {OmniVoice Georgian: Fine-tuning and Benchmark for Georgian Text-to-Speech},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/NMikaa/TTS_pipelines}
}

Please also cite the base OmniVoice paper:

@article{omnivoice2026,
  title = {OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
  author = {Zhu, Han and others},
  year = {2026},
  url = {https://arxiv.org/abs/2604.00688}
}

License

Apache 2.0, inherited from the base k2-fsa/OmniVoice.

Acknowledgments

k2-fsa/OmniVoice — base model and training framework
Meta Omnilingual ASR — used for round-trip evaluation
TTSDS — distribution-based prosody evaluation
Mozilla Common Voice and Google FLEURS — public Georgian datasets used for evaluation