VibeVoice-Realtime-0.5B β ONNX FP16
Microsoft's VibeVoice-Realtime-0.5B exported to ONNX format in FP16 precision with KV-cache support.
Zero PyTorch dependency at inference time. Only requires: onnxruntime, numpy, soundfile, tokenizers.
Architecture
5 ONNX models forming the VibeVoice TTS pipeline:
| Model | Size | Description |
|---|---|---|
text_lm_kv.onnx |
374 MB | 4-layer Qwen2 text encoder with KV-cache |
tts_lm_kv.onnx |
572 MB | 20-layer Qwen2 TTS LM with KV-cache + EOS classifier |
diffusion_head.onnx |
81 MB | Latent denoiser (5-step DPM-Solver++) |
vocoder.onnx |
656 MB | Acoustic decoder (latents β 24kHz audio) |
acoustic_connector.onnx |
1.7 MB | Speech feedback projection |
Plus speaker voice presets (.npz files) and the inference script.
Usage
pip install onnxruntime numpy soundfile tokenizers
python vibevoice_full_onnx.py --text "Hello, this is a test." --speaker Carter
Options
python vibevoice_full_onnx.py \
--text "Your text here" \
--speaker Carter \
--output output.wav \
--cfg_scale 1.5
Available speakers: Carter, Frank, Emma, Grace, Davis, Mike, Wayne, and multilingual (de-Spk0, fr-Spk1, sp-Spk0, etc.)
Pipeline Flow
Text β [text_lm_kv] β hidden states
β
[tts_lm_kv] β [acoustic_connector] β speech latent
β
[diffusion_head] Γ 5 steps with CFG
β
speech latent (64-dim)
β
[vocoder] β audio (24kHz)
Export Details
- Exported from
microsoft/VibeVoice-Realtime-0.5Bin FP16 - Voice presets converted from BF16 β FP16
- DPM-Solver++ scheduler implemented in pure NumPy
- KV-cache passed as explicit inputs/outputs for stateful generation
- ONNX opset 18
License
MIT (same as original VibeVoice model)
- Downloads last month
- 24