VibeVoice-Realtime-0.5B — ONNX FP16

Microsoft's VibeVoice-Realtime-0.5B exported to ONNX format in FP16 precision with KV-cache support.

Zero PyTorch dependency at inference time. Only requires: onnxruntime, numpy, soundfile, tokenizers.

Architecture

5 ONNX models forming the VibeVoice TTS pipeline:

Model	Size	Description
`text_lm_kv.onnx`	374 MB	4-layer Qwen2 text encoder with KV-cache
`tts_lm_kv.onnx`	572 MB	20-layer Qwen2 TTS LM with KV-cache + EOS classifier
`diffusion_head.onnx`	81 MB	Latent denoiser (5-step DPM-Solver++)
`vocoder.onnx`	656 MB	Acoustic decoder (latents → 24kHz audio)
`acoustic_connector.onnx`	1.7 MB	Speech feedback projection

Plus speaker voice presets (.npz files) and the inference script.

Usage

pip install onnxruntime numpy soundfile tokenizers

python vibevoice_full_onnx.py --text "Hello, this is a test." --speaker Carter

Options

python vibevoice_full_onnx.py \
  --text "Your text here" \
  --speaker Carter \
  --output output.wav \
  --cfg_scale 1.5

Available speakers: Carter, Frank, Emma, Grace, Davis, Mike, Wayne, and multilingual (de-Spk0, fr-Spk1, sp-Spk0, etc.)

Pipeline Flow

Text → [text_lm_kv] → hidden states
                          ↓
                    [tts_lm_kv] ← [acoustic_connector] ← speech latent
                          ↓
                    [diffusion_head] × 5 steps with CFG
                          ↓
                    speech latent (64-dim)
                          ↓
                    [vocoder] → audio (24kHz)

Export Details

Exported from microsoft/VibeVoice-Realtime-0.5B in FP16
Voice presets converted from BF16 → FP16
DPM-Solver++ scheduler implemented in pure NumPy
KV-cache passed as explicit inputs/outputs for stateful generation
ONNX opset 18

License

MIT (same as original VibeVoice model)

Downloads last month: 24

Model tree for nenad1002/microsoft-vibevoice-0.5B-onnx-fp16

Base model

Qwen/Qwen2.5-0.5B

Finetuned

microsoft/VibeVoice-Realtime-0.5B

Quantized

(4)

this model