VocoLoco โ OmniVoice ONNX Models
ONNX exports of k2-fsa/OmniVoice for browser-based text-to-speech inference via ONNX Runtime Web.
Models
| File | Size | Description |
|---|---|---|
omnivoice-main-split.onnx + _data_00-_04 |
2.3 GB | Main TTS model (FP32, sharded) |
omnivoice-main-int8.onnx |
586 MB | Main TTS model (INT8 quantized, for mobile/low-memory) |
omnivoice-decoder.onnx |
83 MB | Audio token decoder (tokens to waveform) |
omnivoice-encoder-fixed.onnx |
624 MB | Audio encoder for voice cloning |
tokenizer.json |
11 MB | Qwen2 BPE text tokenizer |
Usage
These models are designed to run in the browser via VocoLoco, a fully client-side TTS application. No server required.
Architecture
- Backbone: Qwen3-0.6B (28 transformer layers)
- Audio codec: HiggsAudioV2 (8 codebooks, 24kHz output)
- Generation: Iterative masked diffusion (configurable 8-32 steps)
- Voice cloning: Zero-shot via reference audio encoding
- Voice design: Text-based control (gender, pitch, accent)
License
Apache 2.0 โ same as the original OmniVoice model.
Attribution
Based on OmniVoice by Xiaomi Corp (k2-fsa).