VoxCPM 1.5 β CoreML
CoreML conversion of VoxCPM 1.5 (800M params, 44.1kHz, bilingual EN/ZH text-to-speech with voice cloning).
Available in two precisions:
| Variant | Folder | Total Size | Notes |
|---|---|---|---|
| FP16 | fp16/ |
1.6 GB | Full precision, highest quality |
| INT8 | int8/ |
783 MB | Linear symmetric quantization, no measurable quality loss. Includes both .mlpackage and pre-compiled .mlmodelc |
Constants (embeddings, projections) are shared between both variants: 299 MB in constants/.
Source
| Original model | openbmb/VoxCPM1.5 |
| Paper | arxiv.org/abs/2509.24650 |
| Code | github.com/OpenBMB/VoxCPM |
| License | Apache 2.0 |
| Converted by | FluidInference |
File Structure
βββ constants/ # Shared (299 MB)
β βββ config.json
β βββ embed_tokens.npy # [73448, 1024] text embeddings (287 MB)
β βββ enc_to_lm_proj_{w,b}.npy
β βββ lm_to_dit_proj_{w,b}.npy
β βββ res_to_dit_proj_{w,b}.npy
βββ fp16/ # Float16 models (1.6 GB)
β βββ audio_vae_encoder.mlpackage
β βββ audio_vae_decoder.mlpackage
β βββ feat_encoder.mlpackage
β βββ base_lm_step.mlpackage
β βββ residual_lm_step.mlpackage
β βββ locdit_step.mlpackage
βββ int8/ # INT8 models (783 MB each format)
βββ audio_vae_encoder.mlpackage
βββ audio_vae_encoder.mlmodelc
βββ audio_vae_decoder.mlpackage
βββ audio_vae_decoder.mlmodelc
βββ feat_encoder.mlpackage
βββ feat_encoder.mlmodelc
βββ base_lm_step.mlpackage
βββ base_lm_step.mlmodelc
βββ residual_lm_step.mlpackage
βββ residual_lm_step.mlmodelc
βββ locdit_step.mlpackage
βββ locdit_step.mlmodelc
Models
6 CoreML models (.mlpackage) split for step-by-step autoregressive generation:
| Model | FP16 | INT8 | Purpose | Input | Output |
|---|---|---|---|---|---|
audio_vae_encoder |
82 MB | 41 MB | Encode prompt audio to latents | [1, 1, 220500] (5s @ 44.1kHz) |
[1, 64, T] |
audio_vae_decoder |
82 MB | 41 MB | Decode latents to 44.1kHz audio | [1, 64, T] (flexible) |
[1, 1, T*1764] |
feat_encoder |
228 MB | 114 MB | Encode latent patches to LM embeddings | [1, 1, 4, 64] |
[1, 1, 1024] |
base_lm_step |
696 MB | 349 MB | Single AR step (24-layer LM + FSQ + stop) | embed + pos + 48 KV caches | lm_hidden + fsq + stop + 48 caches |
residual_lm_step |
236 MB | 119 MB | Single AR step (8-layer residual LM) | embed + pos + 16 KV caches | res_hidden + 16 caches |
locdit_step |
237 MB | 119 MB | Single Euler diffusion step | x, mu, t, cond, dt (batch=2 for CFG) | velocity [2, 64, 4] |
Architecture
VoxCPM 1.5 is a tokenizer-free diffusion autoregressive TTS built on MiniCPM-4 (0.5B LM backbone). It generates 44.1kHz audio at 6.25 Hz token rate using flow matching diffusion.
| Component | Layers | Hidden | Params |
|---|---|---|---|
| base_lm (MiniCPM4) | 24 | 1024 | ~450M |
| residual_lm | 8 | 1024 | ~80M |
| feat_encoder (LocEnc) | 8 | 1024 | ~80M |
| feat_decoder (LocDiT) | 8 | 1024 | ~80M |
| AudioVAE | enc [2,3,6,7,7] / dec [7,7,6,3,2] | 64β2048 | ~130M |
Total: ~800M parameters, 44.1kHz output, 6.25 Hz token rate (patch_size=4)
Conversion Validation
Per-component correlation (CoreML vs PyTorch FP32)
| Model | FP16 | INT8 | Notes |
|---|---|---|---|
| audio_vae_encoder | 0.999989 | 0.999989 | Fixed 5s input, Snake activations patched |
| audio_vae_decoder | 0.999999 | 0.999999 | Flexible latent length via RangeDim |
| feat_encoder | 1.000000 | 1.000000 | 8-layer non-causal transformer |
| base_lm_step | 0.999998 | 0.999998 | 24 layers, GQA patched, scatter-based KV cache |
| residual_lm_step | 1.000000 | 1.000000 | 8 layers, same GQA/cache pattern |
| locdit_step | 0.999999 | 0.999999 | Flow matching estimator, cond_len=4 |
End-to-end verification (both variants produce identical results)
| Language | Input | ASR output | Match |
|---|---|---|---|
| English | "Hello, this is a test of the voice cloning system." | "Hello, this is a test of the voice cloning system." | Exact |
| Chinese | "δ½ ε₯½οΌθΏζ―δΈδΈͺθ―ι³ε ιη³»η»ηζ΅θ―γ" | "δ½ ε₯½θΏζ―δΈδΈͺθ―ι³ε ιη³»η»ηζ΅θ―" | Exact (minus punctuation) |
Generation Pipeline
1. Encode prompt audio: latent = audio_vae_encoder(pad_to_5s(prompt))
2. Reshape into patches: [1, 64, T] β [1, n_patches, 4, 64]
3. Encode patches: feat_emb = feat_encoder(each patch)
4. Project: feat_lm = enc_to_lm_proj(feat_emb)
5. Embed text: text_emb = embed_tokens[token_ids] * scale_emb
6. Combine: [text_emb, audio_start_token, feat_lm] β [1, seq_len, 1024]
7. Prefill: step through all tokens via base_lm_step + residual_lm_step
8. Loop (autoregressive):
a. dit_hidden = lm_to_dit_proj(lm_hidden_fsq) + res_to_dit_proj(res_hidden)
b. noise = randn(1, 64, 4)
c. For t in 10 Euler steps (1.0 β 0.001):
vel = locdit_step(noise, dit_hidden, prefix_cond, t) # batch=2 for CFG
noise = noise - vel * dt
d. pred_feat = noise (after all steps)
e. If stop_head predicts stop and step > min_len: break
f. prefix_cond = pred_feat
g. next_emb = enc_to_lm_proj(feat_encoder(pred_feat))
h. lm_hidden, fsq, stop = base_lm_step(next_emb, pos, caches)
i. res_hidden = residual_lm_step(fsq + next_emb, pos, caches)
9. Decode: audio = audio_vae_decoder(concat(all pred_feats))
Performance
Measured on Apple Silicon (macOS, CPU_AND_GPU compute units, INT8):
| Metric | Value |
|---|---|
| Prefill throughput | ~27 tok/s |
| Generation throughput | ~4.5 steps/s |
| Peak RAM | ~3.8 GB |
| Output sample rate | 44,100 Hz |
| Token rate | 6.25 Hz (160ms per token) |
Usage
Requirements
- macOS 14+ on Apple Silicon
- Python 3.10+ with
coremltoolsandnumpy
Quick start
import coremltools as ct
import numpy as np
# Choose precision: "fp16" or "int8"
precision = "int8"
# Load models
vae_enc = ct.models.MLModel(f"{precision}/audio_vae_encoder.mlpackage")
vae_dec = ct.models.MLModel(f"{precision}/audio_vae_decoder.mlpackage")
feat_enc = ct.models.MLModel(f"{precision}/feat_encoder.mlpackage")
base_lm = ct.models.MLModel(f"{precision}/base_lm_step.mlpackage")
res_lm = ct.models.MLModel(f"{precision}/residual_lm_step.mlpackage")
locdit = ct.models.MLModel(f"{precision}/locdit_step.mlpackage")
# Load shared constants
embed_tokens = np.load("constants/embed_tokens.npy")
# ... see generate_coreml.py for full pipeline
Full generation script
See generate_coreml.py in the conversion repo for a complete zero-PyTorch generation pipeline:
python generate_coreml.py \
--text "Hello, this is a test." \
--prompt prompt.wav \
--prompt-text "This is the prompt transcript." \
--output output.wav
Conversion Details
Converted using coremltools with Float16 compute precision (compute_units=CPU_AND_GPU). INT8 variant post-quantized via ct.optimize.coreml.linear_quantize_weights() (linear symmetric).
Key conversion challenges solved:
- GQA attention (16 query heads, 2 KV heads) β manual
repeat_interleaveexpansion - Snake activations β replaced
@torch.jit.scriptwith simple module - In-place KV cache β functional
scatterreplacement - Euler solver direction β backward (1β0) with
x = x - dt * v - Chinese tokenizer β
mask_multichar_chinese_tokenswrapper for character splitting
See TRIALS.md for the full conversion log.
- Downloads last month
- 191