File size: 6,309 Bytes
ecc3f03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2536d07
 
 
97d10e4
ecc3f03
 
 
 
 
 
 
 
 
 
2536d07
 
 
 
ecc3f03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2536d07
 
 
 
ecc3f03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2536d07
 
 
 
ecc3f03
 
 
 
 
 
 
 
 
 
 
 
 
2536d07
ecc3f03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: apache-2.0
language:
  - en
  - zh
  - id
  - ja
  - ko
  - multilingual
tags:
  - text-to-speech
  - tts
  - voice-cloning
  - voice-design
  - diffusion
  - litert
  - tflite
  - on-device
  - soniqo
  - speech-cloud
  - speech-core
base_model: openbmb/VoxCPM2
library_name: litert
pipeline_tag: text-to-speech
---

# VoxCPM2 β€” LiteRT (INT8)

2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.

> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β€”
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).

## Use cases on soniqo.audio

- [Speech generation](https://soniqo.audio/speech-generation/)
- [Voice cloning](https://soniqo.audio/voice-cloning/)
- [Long-form speech](https://soniqo.audio/long-form-speech/)

LiteRT export of [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2)
β€” a 2 B-parameter diffusion-autoregressive TTS with 48 kHz
studio-quality output, reference-audio voice cloning, and
natural-language voice design. Designed for server-side
synthesis workers and on-device TTS through the
[`speech-core`](https://github.com/soniqo/speech-core)
`TTSInterface`.

## Why split graphs

VoxCPM2 is not a single feed-forward model. The runtime loop is

```
text + optional instruction ──► text-prefill
                                      β”‚
                                      β–Ό
                              repeated token-step
                                      β”‚
                                      β–Ό
                              audio-decoder ──► 48 kHz PCM
```

The host owns the loop and the KV cache; LiteRT owns the
static tensor programs. Same split used for Parakeet and
Nemotron in this collection β€” LiteRT for the math, host for
the control flow.

## Files

| File | Size | Description |
|---|---:|---|
| `voxcpm2-text-prefill.tflite` | 7.7 GB | FP32 text + instruction prefill (MiniCPM-4 KV-cache producer) |
| `voxcpm2-token-step.tflite`   | 2.0 GB | **INT8** weight-only autoregressive step (MiniCPM-4 + residual LM) |
| `voxcpm2-audio-encoder.tflite` | 184 MB | FP32 reference-audio encoder (16 kHz β†’ conditioning) |
| `voxcpm2-audio-decoder.tflite` | 175 MB | FP32 AudioVAE decoder (acoustic tokens β†’ 48 kHz PCM) |
| `tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json` | β€” | HF tokenizer bundle |
| `generation_config.json` / `tokenization_voxcpm2.py` | β€” | Generation defaults + tokenizer module |
| `config.json`                 | β€” | Tensor shapes, sample rates, files manifest |

## Quantization

- **token-step**: INT8 weight-only (the only graph that runs in
  the inner generation loop β€” quantizing here is the biggest win).
- **text-prefill / audio-encoder / audio-decoder**: stay FP32.
  Quantizing prefill caused semantic drift in roundtrip; the
  AudioVAE decoder is audible-risky under INT8.

## Smoke result

30-step English roundtrip (`"hello world from soniqo dot audio"`,
instruction `"clear neutral delivery"`):

- Stop token fired naturally at step 18 (decoder halted before
  the 30-step ceiling)
- 138 240 samples Γ— 48 kHz mono = 2.88 s
- RMS 0.033, peak 0.44 β€” no clipping, real signal level
- Output written to `voxcpm2-litert-hello-world.wav`

## Modes

Mirrors the [speech-swift `VoxCPM2TTS`](https://github.com/soniqo/speech-swift)
mode matrix:

| Mode | Inputs |
|---|---|
| Zero-shot | text |
| Voice design | text + style instruction |
| Controllable cloning | text + reference audio |
| Ultimate cloning | text + reference audio + prompt audio + prompt text |

For Apple Silicon, prefer the MLX bundles
([bf16](https://huggingface.co/aufklarer/VoxCPM2-MLX-bf16) /
 [int8](https://huggingface.co/aufklarer/VoxCPM2-MLX-int8) /
 [int4](https://huggingface.co/aufklarer/VoxCPM2-MLX-int4))
consumed by `speech-swift`.

## Source

Exported from [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2)
via a graph-split LiteRT conversion, run in a pinned Docker
environment because LiteRT / Torch / TorchAO versions are
tightly coupled.

## Responsible use

Voice cloning is included. Users are responsible for obtaining
consent for any voice that is cloned and for not using the model
to impersonate individuals without permission, generate
disinformation, or commit fraud.

## Ecosystem

- [**soniqo.audio**](https://soniqo.audio) β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) β€” C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) β€” Android SDK consuming on-device LiteRT bundles.

## Other LiteRT models in this collection

**ASR / Transcription**

- [Parakeet TDT 0.6B v3 β€” LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B β€” LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
- [Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)

**VAD / Diarization**

- [Silero VAD v5 β€” LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [Pyannote Segmentation 3.0 β€” LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
- [WeSpeaker ResNet34-LM β€” LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)

## License

This bundle inherits the upstream model license (**apache-2.0**). See the
linked `base_model` repository for the full terms.