Step-Audio-EditX โ MLX 8-bit
This repository contains a self-contained pure-MLX int8 conversion of
Step-Audio-EditX for local voice cloning and expressive audio editing on
Apple Silicon. All pipeline components are stored as .safetensors โ no
PyTorch, ONNX, or NumPy files are required at inference time.
Model Details
- Developed by: AppAutomaton
- Upstream model:
stepfun-ai/Step-Audio-EditX - Task: zero-shot voice cloning, expressive audio editing
- Runtime: MLX on Apple Silicon
- Precision: int8 for Step1 LM, Flow model, and VQ02 tokenizer; bf16 for the rest
- Total size: ~4.1 GB (down from ~7.7 GB upstream)
Bundle Contents
This bundle is self-contained โ all weights are packaged in one repository.
| File | Component | Format | Size |
|---|---|---|---|
model.safetensors |
Step1 LM (3.5B params) | int8 | 3.5 GB |
flow-model.safetensors |
Flow model (DiT + conformer) | int8 | 181 MB |
vq02.safetensors |
VQ02 audio tokenizer | int8 | 162 MB |
vq06.safetensors |
VQ06 audio tokenizer | bf16 | 249 MB |
hift.safetensors |
HiFT vocoder | bf16 | 40 MB |
campplus.safetensors |
CampPlus speaker embedding | bf16 | 13 MB |
flow-conditioner.safetensors |
Flow conditioner | bf16 | 2.5 MB |
config.json |
Step1 LM config + quantization | JSON | โ |
flow-model-config.json |
Flow model config | JSON | โ |
vq02-config.json, vq06-config.json |
Tokenizer configs | JSON | โ |
hift-config.json, campplus-config.json, flow-conditioner-config.json |
Component configs | JSON | โ |
tokenizer.json, tokenizer.model, tokenizer_config.json |
Step1 tokenizer | JSON | โ |
How to Get Started
Download the bundle:
hf download appautomaton/step-audio-editx-8bit-mlx \
--local-dir models/stepfun/step_audio_editx/mlx-int8
Voice cloning:
python scripts/generate/step_audio_editx.py \
--prompt-audio reference.wav \
--prompt-text "Transcript of reference audio." \
-o cloned.wav \
clone --target-text "New speech in the cloned voice."
Audio editing (change emotion):
python scripts/generate/step_audio_editx.py \
--prompt-audio input.wav \
--prompt-text "Transcript of input audio." \
-o happy.wav \
edit --edit-type emotion --edit-info happy
Supported Edit Types
| Edit type | Description | --edit-info examples |
|---|---|---|
emotion |
Change the emotion of speech | happy, sad, angry, surprised |
style |
Change speaking style | whispering, broadcasting, formal |
speed |
Change speaking speed | fast, slow |
denoise |
Remove noise from audio | not used |
vad |
Remove silences from audio | not used |
paralinguistic |
Add non-verbal sounds | requires --target-text |
Architecture
Five-stage pipeline, all running pure MLX with bf16 activations:
- Step1 LM (3.5B params, int8) โ autoregressive dual-codebook token generation
- CampPlus (bf16) โ speaker embedding extraction from reference audio
- Flow conditioner (bf16) โ conditions generation on speaker embedding
- Flow model (int8) โ flow-matching mel spectrogram generation
- HiFT vocoder (bf16) โ mel spectrogram to waveform
The VQ02 and VQ06 tokenizers encode reference audio into dual codebook tokens consumed by Step1.
Performance
On Apple Silicon with int8 weights and bf16 activations, real-time factor (RTF) is approximately 1.46x for voice cloning โ faster than real-time.
Links
- Source code:
mlx-speech - Upstream model:
stepfun-ai/Step-Audio-EditX - Technical report: arXiv:2511.03601
- More examples: AppAutomaton
License
Apache 2.0 โ following the upstream license published with
stepfun-ai/Step-Audio-EditX.
- Downloads last month
- 97
Quantized
Model tree for appautomaton/step-audio-editx-8bit-mlx
Base model
stepfun-ai/Step-Audio-EditX