VibeVoice β MLX
VibeVoice Large converted and quantized for native MLX inference on Apple Silicon. Hybrid LLM + diffusion architecture for long-form speech, multi-speaker dialogue, and voice cloning.
Variants
| Path | Precision |
|---|---|
mlx-int8/ |
int8 quantized weights |
How to Get Started
Single speaker:
python scripts/generate/vibevoice.py \
--text "Hello from VibeVoice." \
--output outputs/vibevoice.wav
Multi-speaker dialogue β speaker labels are 0-based:
python scripts/generate/vibevoice.py \
--text "Speaker 0: Have you tried VibeVoice?
Speaker 1: Not yet. Does it need PyTorch?
Speaker 0: No. Pure MLX, runs locally on Apple Silicon.
Speaker 1: That is impressive." \
--output outputs/dialogue.wav
Voice cloning β one reference WAV per speaker:
python scripts/generate/vibevoice.py \
--text "Speaker 0: This is cloned from the reference." \
--reference-audio-speaker0 ref_speaker0.wav \
--output outputs/clone.wav
Up to 4 speakers supported: --reference-audio-speaker0 through --reference-audio-speaker3.
Default generation settings (matching upstream):
- Greedy decoding (deterministic)
- Seed: 42
- Diffusion steps: 20
Add --no-greedy to enable temperature + top-p sampling.
Model Details
VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping β no PyTorch at inference time.
See mlx-speech for the full runtime and conversion code.
License
Apache 2.0.
Quantized