You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License, also known as CC BY-NC 4.0.
Access is limited to non-commercial use. Users must provide appropriate attribution when sharing or adapting the model where required by the license. Commercial use, including use in paid products, paid services, or commercial deployment, is not permitted under this license.
Log in or Sign Up to review the conditions and access this model content.
OmniVoice Hakka Community 1
formospeech/omnivoice-hakka-community-1 is a Taiwanese Hakka fine-tuned checkpoint based on k2-fsa/OmniVoice.
Summary
- Base model:
k2-fsa/OmniVoice - Model type:
OmniVoice - Primary use case: Taiwanese Hakka text-to-speech
- Released checkpoint: trained for
16000steps - Supported generation modes:
- zero-shot voice cloning with
ref_audio+ref_text - voice design with
instruct
- zero-shot voice cloning with
Usage
To get started, install the Hakka-enabled OmniVoice fork:
FormoSpeech/OmniVoice-hakka
This fork is required because the upstream k2-fsa/OmniVoice runtime does not support Hakka dialect labels such as ๅฎข่ชๅ็ธฃ่
in instruct.
For this Hakka checkpoint, the target dialect must be specified through instruct if you want to control the generated Hakka dialect. This differs from the base OmniVoice usage because Hakka dialect prompting is part of the intended inference setup for this release.
Access and Authentication
This model is hosted as a gated Hugging Face repository. Before using it:
- Visit the model page and request access.
- Log in with the same Hugging Face account that has been granted access.
- Authenticate your local environment with a Hugging Face access token.
A read token is sufficient for inference.
pip install -U huggingface_hub
hf auth login
Alternatively, you can provide the token through the HF_TOKEN environment variable:
export HF_TOKEN=hf_xxx
Do not hard-code your Hugging Face token in scripts, notebooks, or public repositories.
If you see an error such as Cannot access gated repo, make sure that:
- your Hugging Face account has been granted access to this model;
hf auth whoamishows the expected account;HF_HUB_DISABLE_IMPLICIT_TOKENis not set.
Installation
Step 1: Install PyTorch
NVIDIA GPU
# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
See PyTorch official site for other versions installation.
Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0
Step 2: Install OmniVoice
pip install git+https://github.com/FormoSpeech/OmniVoice-hakka.git
Python API
import soundfile as sf
import torch
from omnivoice import OmniVoice
model = OmniVoice.from_pretrained(
"formospeech/omnivoice-hakka-community-1",
device_map="cuda:0",
dtype=torch.float16,
)
audio = model.generate(
text="ๅฎข่ช่ช้ณๅๆๆธฌ่ฉฆใ",
ref_audio="ref.wav",
ref_text="้ไฟๅ่่ช้ณใ",
instruct="ๅฎข่ชๅ็ธฃ่
",
)
sf.write("out.wav", audio[0], 24000)
Supported Hakka Dialect Labels
Supported Hakka dialect labels in this project are:
ๅฎข่ชๅ็ธฃ่ ๅฎข่ชๆตท้ธ่ ๅฎข่ชๅคงๅ่ ๅฎข่ช้ฅๅนณ่ ๅฎข่ช่ฉๅฎ่ ๅฎข่ชๅๅ็ธฃ่
Training Data
Training uses a mixed FormoSpeech Hakka setup.
Original datasets:
formospeech/hakkaradio_news_cleanformospeech/hat_tts_hailu_cleanformospeech/hat_tts_sixian_cleanformospeech/hakka_elearning_example_clean
Denoised -R datasets:
formospeech/hat_asr_sixian_reading_clean_rformospeech/hat_asr_hailu_reading_clean_rformospeech/hat_asr_nansixian_reading_clean_rformospeech/hat_asr_sixian_broadcast_clean_r
Text Processing
Training uses cleaned hanzi_cln / pinyin_cln text with a restored-punctuation pipeline.
The restored pipeline applies the following steps:
- Samples containing
<UNK>or<spn>are filtered out. <SIL>/<sil>markers are removed from the cleaned text.- The raw text is split into text units and the punctuation attached to each unit is recorded.
- For Hanzi, the units are individual Chinese characters plus inline ASCII word spans.
- For pinyin, the units are whitespace-delimited pinyin tokens.
- The cleaned text is converted into the same kind of units after removing silence markers.
- The raw-unit sequence and cleaned-unit sequence are aligned with partial matching.
- Punctuation is restored for the parts of the cleaned text that still align with the raw text.
- Small insertions, deletions, or replacements do not prevent punctuation recovery for surrounding matched regions.
- Punctuation is copied back only for aligned regions and is not forced into mismatched regions.
- After local punctuation transfer, the text is normalized:
- leading punctuation is removed
- missing or weak sentence-final punctuation is normalized
- in
text_pinyin, full-width punctuation is kept and extra spaces around punctuation are removed
In practice, this means the pipeline is conservative for local mismatches while still recovering punctuation around them. Short or fragmentary utterances can still be normalized into sentence-final ใ.
Training Config
This release was trained with:
- cleaned
cln restoredmanifests instruct_ratio = 1.0only_instruct_ratio = 0.3use_pinyin_ratio = 0.3- total training steps:
16000
Evaluation
Evaluation uses a custom Taiwanese Hakka test list built from the test split of:
formospeech/hakkaradio_news_cleanHakka_HailuHakka_Sixian
Construction rules:
- target utterances are selected with duration between
3and30seconds - reference utterances are selected with duration between
3and15seconds - each
ref_audiocomes from the same speaker as the target ref_id != id- references are assigned to increase diversity per speaker
Evaluation metrics and models:
CER: computed with the Taiwanese Hakka ASR model
formospeech/whisper-large-v2-taiwanese-hakka-v1SIM-o: computed with a WavLM-based speaker verification model distributed through
k2-fsa/TTS_eval_modelsUTMOS: computed with the UTMOS predictor distributed through
k2-fsa/TTS_eval_models
Released checkpoint result:
| Metric | Score |
|---|---|
| Hakka CER (Avg of sample CERs) | 2.61% |
| Hakka CER (Weighted) | 2.29% |
| SIM-o | 0.801 |
| UTMOS | 3.65 |
Comparison against the current F5-TTS baseline formospeech/f5-tts-hita-finetune-v1 on the same Hakka evaluation setup:
g2p: inference converts Hanzi to pinyin withformog2p.hakka.g2pgt_pinyin: inference uses the existing pinyin annotations from the manifest / test list
| Model | CER avg | CER weighted | SIM-o | UTMOS |
|---|---|---|---|---|
| OmniVoice Taiwanese Hakka | 2.61% | 2.29% | 0.801 | 3.65 |
| F5-TTS g2p | 6.74% | 6.43% | 0.794 | 3.29 |
| F5-TTS gt_pinyin | 6.20% | 5.87% | 0.795 | 3.26 |
Notes
This release is optimized for Taiwanese Hakka TTS and keeps the general OmniVoice interface from the base model.
The original OmniVoice features for Voice Design and Fine-grained Control may be weaker in this checkpoint. The finetuning data used for this release does not provide the full annotation coverage needed to preserve those capabilities. This is especially true for Voice Design, because finetuning actively uses
instruct, so the Hakka-dialect prompting setup can interfere more strongly with the broader attribute-control behavior from the base model.Evaluation numbers reported here come from the project-side Hakka benchmark setup, not from a standardized public benchmark leaderboard.-
- Downloads last month
- 97