VocoLoco โ€” OmniVoice ONNX Models

ONNX exports of k2-fsa/OmniVoice for browser-based text-to-speech inference via ONNX Runtime Web.

Models

File Size Description
omnivoice-main-split.onnx + _data_00-_04 2.3 GB Main TTS model (FP32, sharded)
omnivoice-main-int8.onnx 586 MB Main TTS model (INT8 quantized, for mobile/low-memory)
omnivoice-decoder.onnx 83 MB Audio token decoder (tokens to waveform)
omnivoice-encoder-fixed.onnx 624 MB Audio encoder for voice cloning
tokenizer.json 11 MB Qwen2 BPE text tokenizer

Usage

These models are designed to run in the browser via VocoLoco, a fully client-side TTS application. No server required.

Architecture

  • Backbone: Qwen3-0.6B (28 transformer layers)
  • Audio codec: HiggsAudioV2 (8 codebooks, 24kHz output)
  • Generation: Iterative masked diffusion (configurable 8-32 steps)
  • Voice cloning: Zero-shot via reference audio encoding
  • Voice design: Text-based control (gender, pitch, accent)

License

Apache 2.0 โ€” same as the original OmniVoice model.

Attribution

Based on OmniVoice by Xiaomi Corp (k2-fsa).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Gigsu/vocoloco-onnx

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Quantized
(1)
this model