OmniVoice Georgian — Fine-Tuned

A fine-tuned variant of k2-fsa/OmniVoice for Georgian (ქართული) text-to-speech with zero-shot voice cloning support. Trained on an Georgian speech corpus (~150 hours, 29 speakers) using a low learning rate (2e-5) to preserve the multilingual capabilities of the base model.

📊 Benchmark report: github.com/NMikaa/TTS_pipelines 🛠️ Training recipe: pipelines/omnivoice/README.md

Why use this model?

Compared to the pretrained OmniVoice baseline, this fine-tuned variant offers:

  1. Robust voice generation across genders. Pretrained OmniVoice tends to collapse toward overrepresented Common Voice speakers — when given a male settings, it sometimes generates female output. This model produces both male and female voices reliably from any Georgian reference.

  2. No language leakage in cross-lingual cloning. Verified 0% Georgian Unicode leakage in English ASR transcripts when using a Georgian reference for English text generation.

  3. Subjective Georgian quality. Native Georgian listener feedback indicates clearly more authentic phonation, vowel quality, and intonation compared to pretrained.

  4. Matches pretrained intelligibility. Within 0.03% CER on FLEURS Georgian (1.61% vs 1.64%) — statistical noise — while delivering the qualitative improvements above.

Installation

OmniVoice is not on PyPI yet — you must install it from source first. Then this model can be loaded directly via from_pretrained.

# Install OmniVoice from source (requires Python 3.10+, torch 2.8+ recommended)
pip install git+https://github.com/k2-fsa/OmniVoice.git

# Optional: install datasets if you want to evaluate
pip install datasets soundfile torchaudio

GPU recommended. Inference works on CPU but is much slower. ~3GB VRAM is enough for inference in fp16. Note about torchaudio 2.11+: torchaudio 2.11 removed all backends except torchcodec. If you hit ImportError: TorchCodec is required, either install torchcodec (pip install torchcodec) or downgrade to torchaudio==2.8.0.

Quick Start

import torch
from omnivoice import OmniVoice
from omnivoice.models.omnivoice import OmniVoiceGenerationConfig
import torchaudio

model = OmniVoice.from_pretrained(
    "NMikka/omnivoice-finetuned-ka",
    device_map="cuda:0",
    dtype=torch.float16,
    load_asr=True,  # auto-transcribe reference audio
)

# Voice cloning with reference
prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text=None,  # auto-transcribed if None
)

result = model.generate(
    text="გამარჯობა, ეს არის ქართული ტექსტი.",
    language="Georgian",
    voice_clone_prompt=prompt,
    generation_config=OmniVoiceGenerationConfig(
        num_step=32,
        guidance_scale=2.0,
    ),
)

torchaudio.save("output.wav", result[0].cpu(), 24000)

For voice design (without reference audio), use instruct= instead of voice_clone_prompt=:

result = model.generate(
    text="გამარჯობა, ეს არის ქართული ტექსტი.",
    language="Georgian",
    instruct="female, young adult",
    generation_config=OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0),
)

Benchmark Results (FLEURS Georgian, 979 samples, normalized text)

Model FL-CER ↓ FL-MOS ↑ TTSDS Pitch ↑ TTSDS SR ↑ Voice Cloning
This model (099v2_ckpt480) 1.61% 2.920 82.07 75.51 ✅ Robust both genders
OmniVoice pretrained 1.64% 2.749 85.64 76.34 ⚠️ Speaker collapse

See the full benchmark report for TTSDS prosody analysis, per-checkpoint ablations, and discussion of the metrics-vs-listening gap for low-resource languages.

Training Details

Parameter Value
Base model k2-fsa/OmniVoice (Qwen3-0.6B + HiggsAudioV2 codec, 600M params)
Training data ~100 hours of Georgian Speech Data
Quality threshold 0.99 text-audio match ratio
Learning rate 2e-5 (lower than official 5e-5 to preserve cross-lingual capabilities)
Warmup ratio 0.01
Steps 480 (~2 epochs)
Batch size 4096 tokens × 2 GPUs × 4 grad accum = ~32k effective
Mixed precision bf16
Hardware 2 × A6000 (48GB)

Critical findings:

  • lr=2e-5 is the right choice — the official 5e-5 caused too much catastrophic forgetting of cross-lingual capability
  • Best checkpoint is at ~2 epochs (step 480). Beyond epoch 4, English regression starts to appear
  • 0% language leakage in cross-lingual cloning (verified on FLEURS English with Georgian reference)

Limitations

  1. No native human MOS evaluation — the gold standard for TTS quality. We rely on automatic metrics + native listener feedback (anecdotal). A formal MOS study would be the next step.
  2. Subjective improvements are not fully captured by automatic metrics — TTSDS Pitch and CER scores are slightly worse than pretrained, but Georgian listeners report clearly better quality. This is documented in detail in the benchmark report.
  3. Speaker similarity decreased slightly vs pretrained (0.683 vs 0.717 on CV-GEO with WavLM-ECAPA) — typical artifact of fine-tuning a multilingual model on a narrower distribution.
  4. English quality unchanged or slightly improved but cross-lingual prosody quality not formally verified beyond CER.

Intended Use

  • Georgian text-to-speech with voice cloning
  • Multilingual applications requiring Georgian + English synthesis
  • Research on fine-tuning multilingual TTS models for low-resource languages

Out of scope:

  • Highest possible Georgian intelligibility — for that, see MagPIE TTS Georgian
  • Languages other than Georgian and English (the base model supports 646 languages but fine-tuning was only on Georgian; other languages may degrade slightly (I HOPE NOT :) )

Citation

@misc{omnivoice-georgian-2026,
  author = {Mikaberidze, Nika},
  title = {OmniVoice Georgian: Fine-tuning and Benchmark for Georgian Text-to-Speech},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/NMikaa/TTS_pipelines}
}

Please also cite the base OmniVoice paper:

@article{omnivoice2026,
  title = {OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
  author = {Zhu, Han and others},
  year = {2026},
  url = {https://arxiv.org/abs/2604.00688}
}

License

Apache 2.0, inherited from the base k2-fsa/OmniVoice.

Acknowledgments

Downloads last month
434
Safetensors
Model size
0.6B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NMikka/omnivoice-finetuned-ka

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(11)
this model

Datasets used to train NMikka/omnivoice-finetuned-ka

Paper for NMikka/omnivoice-finetuned-ka