VoxCPM 1.5 β€” CoreML

CoreML conversion of VoxCPM 1.5 (800M params, 44.1kHz, bilingual EN/ZH text-to-speech with voice cloning).

Available in two precisions:

Variant Folder Total Size Notes
FP16 fp16/ 1.6 GB Full precision, highest quality
INT8 int8/ 783 MB Linear symmetric quantization, no measurable quality loss. Includes both .mlpackage and pre-compiled .mlmodelc

Constants (embeddings, projections) are shared between both variants: 299 MB in constants/.

Source

Original model openbmb/VoxCPM1.5
Paper arxiv.org/abs/2509.24650
Code github.com/OpenBMB/VoxCPM
License Apache 2.0
Converted by FluidInference

File Structure

β”œβ”€β”€ constants/                         # Shared (299 MB)
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ embed_tokens.npy               # [73448, 1024] text embeddings (287 MB)
β”‚   β”œβ”€β”€ enc_to_lm_proj_{w,b}.npy
β”‚   β”œβ”€β”€ lm_to_dit_proj_{w,b}.npy
β”‚   └── res_to_dit_proj_{w,b}.npy
β”œβ”€β”€ fp16/                              # Float16 models (1.6 GB)
β”‚   β”œβ”€β”€ audio_vae_encoder.mlpackage
β”‚   β”œβ”€β”€ audio_vae_decoder.mlpackage
β”‚   β”œβ”€β”€ feat_encoder.mlpackage
β”‚   β”œβ”€β”€ base_lm_step.mlpackage
β”‚   β”œβ”€β”€ residual_lm_step.mlpackage
β”‚   └── locdit_step.mlpackage
└── int8/                              # INT8 models (783 MB each format)
    β”œβ”€β”€ audio_vae_encoder.mlpackage
    β”œβ”€β”€ audio_vae_encoder.mlmodelc
    β”œβ”€β”€ audio_vae_decoder.mlpackage
    β”œβ”€β”€ audio_vae_decoder.mlmodelc
    β”œβ”€β”€ feat_encoder.mlpackage
    β”œβ”€β”€ feat_encoder.mlmodelc
    β”œβ”€β”€ base_lm_step.mlpackage
    β”œβ”€β”€ base_lm_step.mlmodelc
    β”œβ”€β”€ residual_lm_step.mlpackage
    β”œβ”€β”€ residual_lm_step.mlmodelc
    β”œβ”€β”€ locdit_step.mlpackage
    └── locdit_step.mlmodelc

Models

6 CoreML models (.mlpackage) split for step-by-step autoregressive generation:

Model FP16 INT8 Purpose Input Output
audio_vae_encoder 82 MB 41 MB Encode prompt audio to latents [1, 1, 220500] (5s @ 44.1kHz) [1, 64, T]
audio_vae_decoder 82 MB 41 MB Decode latents to 44.1kHz audio [1, 64, T] (flexible) [1, 1, T*1764]
feat_encoder 228 MB 114 MB Encode latent patches to LM embeddings [1, 1, 4, 64] [1, 1, 1024]
base_lm_step 696 MB 349 MB Single AR step (24-layer LM + FSQ + stop) embed + pos + 48 KV caches lm_hidden + fsq + stop + 48 caches
residual_lm_step 236 MB 119 MB Single AR step (8-layer residual LM) embed + pos + 16 KV caches res_hidden + 16 caches
locdit_step 237 MB 119 MB Single Euler diffusion step x, mu, t, cond, dt (batch=2 for CFG) velocity [2, 64, 4]

Architecture

VoxCPM 1.5 is a tokenizer-free diffusion autoregressive TTS built on MiniCPM-4 (0.5B LM backbone). It generates 44.1kHz audio at 6.25 Hz token rate using flow matching diffusion.

Component Layers Hidden Params
base_lm (MiniCPM4) 24 1024 ~450M
residual_lm 8 1024 ~80M
feat_encoder (LocEnc) 8 1024 ~80M
feat_decoder (LocDiT) 8 1024 ~80M
AudioVAE enc [2,3,6,7,7] / dec [7,7,6,3,2] 64β†’2048 ~130M

Total: ~800M parameters, 44.1kHz output, 6.25 Hz token rate (patch_size=4)

Conversion Validation

Per-component correlation (CoreML vs PyTorch FP32)

Model FP16 INT8 Notes
audio_vae_encoder 0.999989 0.999989 Fixed 5s input, Snake activations patched
audio_vae_decoder 0.999999 0.999999 Flexible latent length via RangeDim
feat_encoder 1.000000 1.000000 8-layer non-causal transformer
base_lm_step 0.999998 0.999998 24 layers, GQA patched, scatter-based KV cache
residual_lm_step 1.000000 1.000000 8 layers, same GQA/cache pattern
locdit_step 0.999999 0.999999 Flow matching estimator, cond_len=4

End-to-end verification (both variants produce identical results)

Language Input ASR output Match
English "Hello, this is a test of the voice cloning system." "Hello, this is a test of the voice cloning system." Exact
Chinese "δ½ ε₯½οΌŒθΏ™ζ˜―δΈ€δΈͺθ―­ιŸ³ε…‹ιš†η³»η»Ÿηš„ζ΅‹θ―•γ€‚" "δ½ ε₯½θΏ™ζ˜―δΈ€δΈͺθ―­ιŸ³ε…‹ιš†η³»η»Ÿηš„ζ΅‹θ―•" Exact (minus punctuation)

Generation Pipeline

1. Encode prompt audio: latent = audio_vae_encoder(pad_to_5s(prompt))
2. Reshape into patches: [1, 64, T] β†’ [1, n_patches, 4, 64]
3. Encode patches: feat_emb = feat_encoder(each patch)
4. Project: feat_lm = enc_to_lm_proj(feat_emb)
5. Embed text: text_emb = embed_tokens[token_ids] * scale_emb
6. Combine: [text_emb, audio_start_token, feat_lm] β†’ [1, seq_len, 1024]
7. Prefill: step through all tokens via base_lm_step + residual_lm_step
8. Loop (autoregressive):
   a. dit_hidden = lm_to_dit_proj(lm_hidden_fsq) + res_to_dit_proj(res_hidden)
   b. noise = randn(1, 64, 4)
   c. For t in 10 Euler steps (1.0 β†’ 0.001):
        vel = locdit_step(noise, dit_hidden, prefix_cond, t)  # batch=2 for CFG
        noise = noise - vel * dt
   d. pred_feat = noise (after all steps)
   e. If stop_head predicts stop and step > min_len: break
   f. prefix_cond = pred_feat
   g. next_emb = enc_to_lm_proj(feat_encoder(pred_feat))
   h. lm_hidden, fsq, stop = base_lm_step(next_emb, pos, caches)
   i. res_hidden = residual_lm_step(fsq + next_emb, pos, caches)
9. Decode: audio = audio_vae_decoder(concat(all pred_feats))

Performance

Measured on Apple Silicon (macOS, CPU_AND_GPU compute units, INT8):

Metric Value
Prefill throughput ~27 tok/s
Generation throughput ~4.5 steps/s
Peak RAM ~3.8 GB
Output sample rate 44,100 Hz
Token rate 6.25 Hz (160ms per token)

Usage

Requirements

  • macOS 14+ on Apple Silicon
  • Python 3.10+ with coremltools and numpy

Quick start

import coremltools as ct
import numpy as np

# Choose precision: "fp16" or "int8"
precision = "int8"

# Load models
vae_enc = ct.models.MLModel(f"{precision}/audio_vae_encoder.mlpackage")
vae_dec = ct.models.MLModel(f"{precision}/audio_vae_decoder.mlpackage")
feat_enc = ct.models.MLModel(f"{precision}/feat_encoder.mlpackage")
base_lm = ct.models.MLModel(f"{precision}/base_lm_step.mlpackage")
res_lm = ct.models.MLModel(f"{precision}/residual_lm_step.mlpackage")
locdit = ct.models.MLModel(f"{precision}/locdit_step.mlpackage")

# Load shared constants
embed_tokens = np.load("constants/embed_tokens.npy")
# ... see generate_coreml.py for full pipeline

Full generation script

See generate_coreml.py in the conversion repo for a complete zero-PyTorch generation pipeline:

python generate_coreml.py \
  --text "Hello, this is a test." \
  --prompt prompt.wav \
  --prompt-text "This is the prompt transcript." \
  --output output.wav

Conversion Details

Converted using coremltools with Float16 compute precision (compute_units=CPU_AND_GPU). INT8 variant post-quantized via ct.optimize.coreml.linear_quantize_weights() (linear symmetric).

Key conversion challenges solved:

  1. GQA attention (16 query heads, 2 KV heads) β€” manual repeat_interleave expansion
  2. Snake activations β€” replaced @torch.jit.script with simple module
  3. In-place KV cache β€” functional scatter replacement
  4. Euler solver direction β€” backward (1β†’0) with x = x - dt * v
  5. Chinese tokenizer β€” mask_multichar_chinese_tokens wrapper for character splitting

See TRIALS.md for the full conversion log.

Downloads last month
191
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for alexwengg/voxcpm-1.5-coreml

Quantized
(3)
this model

Paper for alexwengg/voxcpm-1.5-coreml