VibeVoice β€” MLX

VibeVoice Large converted and quantized for native MLX inference on Apple Silicon. Hybrid LLM + diffusion architecture for long-form speech, multi-speaker dialogue, and voice cloning.

Variants

Path Precision
mlx-int8/ int8 quantized weights

How to Get Started

Single speaker:

python scripts/generate/vibevoice.py \
  --text "Hello from VibeVoice." \
  --output outputs/vibevoice.wav

Multi-speaker dialogue β€” speaker labels are 0-based:

python scripts/generate/vibevoice.py \
  --text "Speaker 0: Have you tried VibeVoice?
Speaker 1: Not yet. Does it need PyTorch?
Speaker 0: No. Pure MLX, runs locally on Apple Silicon.
Speaker 1: That is impressive." \
  --output outputs/dialogue.wav

Voice cloning β€” one reference WAV per speaker:

python scripts/generate/vibevoice.py \
  --text "Speaker 0: This is cloned from the reference." \
  --reference-audio-speaker0 ref_speaker0.wav \
  --output outputs/clone.wav

Up to 4 speakers supported: --reference-audio-speaker0 through --reference-audio-speaker3.

Default generation settings (matching upstream):

  • Greedy decoding (deterministic)
  • Seed: 42
  • Diffusion steps: 20

Add --no-greedy to enable temperature + top-p sampling.

Model Details

VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping β€” no PyTorch at inference time.

See mlx-speech for the full runtime and conversion code.

License

Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support