VibeVoice-Realtime-0.5B β€” ONNX FP16

Microsoft's VibeVoice-Realtime-0.5B exported to ONNX format in FP16 precision with KV-cache support.

Zero PyTorch dependency at inference time. Only requires: onnxruntime, numpy, soundfile, tokenizers.

Architecture

5 ONNX models forming the VibeVoice TTS pipeline:

Model Size Description
text_lm_kv.onnx 374 MB 4-layer Qwen2 text encoder with KV-cache
tts_lm_kv.onnx 572 MB 20-layer Qwen2 TTS LM with KV-cache + EOS classifier
diffusion_head.onnx 81 MB Latent denoiser (5-step DPM-Solver++)
vocoder.onnx 656 MB Acoustic decoder (latents β†’ 24kHz audio)
acoustic_connector.onnx 1.7 MB Speech feedback projection

Plus speaker voice presets (.npz files) and the inference script.

Usage

pip install onnxruntime numpy soundfile tokenizers

python vibevoice_full_onnx.py --text "Hello, this is a test." --speaker Carter

Options

python vibevoice_full_onnx.py \
  --text "Your text here" \
  --speaker Carter \
  --output output.wav \
  --cfg_scale 1.5

Available speakers: Carter, Frank, Emma, Grace, Davis, Mike, Wayne, and multilingual (de-Spk0, fr-Spk1, sp-Spk0, etc.)

Pipeline Flow

Text β†’ [text_lm_kv] β†’ hidden states
                          ↓
                    [tts_lm_kv] ← [acoustic_connector] ← speech latent
                          ↓
                    [diffusion_head] Γ— 5 steps with CFG
                          ↓
                    speech latent (64-dim)
                          ↓
                    [vocoder] β†’ audio (24kHz)

Export Details

  • Exported from microsoft/VibeVoice-Realtime-0.5B in FP16
  • Voice presets converted from BF16 β†’ FP16
  • DPM-Solver++ scheduler implemented in pure NumPy
  • KV-cache passed as explicit inputs/outputs for stateful generation
  • ONNX opset 18

License

MIT (same as original VibeVoice model)

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nenad1002/microsoft-vibevoice-0.5B-onnx-fp16

Quantized
(4)
this model