soniqo
/

VoxCPM2-LiteRT-INT8

+---
+license: apache-2.0
+language:
+  - en
+  - zh
+  - id
+  - ja
+  - ko
+  - multilingual
+tags:
+  - text-to-speech
+  - tts
+  - voice-cloning
+  - voice-design
+  - diffusion
+  - litert
+  - tflite
+  - on-device
+  - soniqo
+  - speech-cloud
+  - speech-core
+base_model: openbmb/VoxCPM2
+library_name: litert
+pipeline_tag: text-to-speech
+---
+# VoxCPM2 — LiteRT (INT8)
+2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.
+> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit —
+> an open, runtime-portable stack for speech AI. This bundle is the
+> **LiteRT** export; served from cloud by
+> [`speech-cloud`](https://github.com/soniqo/speech-cloud) and embeddable
+> on-device through [`speech-core`](https://github.com/soniqo/speech-core).
+> Browse all LiteRT bundles in the
+> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert).
+## Use cases on soniqo.audio
+- [Speech generation](https://soniqo.audio/speech-generation/)
+- [Voice cloning](https://soniqo.audio/voice-cloning/)
+- [Long-form speech](https://soniqo.audio/long-form-speech/)
+LiteRT export of [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2)
+— a 2 B-parameter diffusion-autoregressive TTS with 48 kHz
+studio-quality output, reference-audio voice cloning, and
+natural-language voice design. Consumed by the
+[`speech-cloud`](https://github.com/soniqo/speech-cloud)
+synthesis worker (`--mode=synthesize-worker`).
+## Why split graphs
+VoxCPM2 is not a single feed-forward model. The runtime loop is
+```
+text + optional instruction ──► text-prefill
+                                      │
+                                      ▼
+                              repeated token-step
+                                      │
+                                      ▼
+                              audio-decoder ──► 48 kHz PCM
+```
+The C++ worker owns the loop and the KV cache; LiteRT owns the
+static tensor programs. Same split that `speech-cloud` uses for
+Parakeet and Nemotron — LiteRT for the math, C++ for the control
+flow.
+## Files
+| File | Size | Description |
+|---|---:|---|
+| `voxcpm2-text-prefill.tflite` | 7.7 GB | FP32 text + instruction prefill (MiniCPM-4 KV-cache producer) |
+| `voxcpm2-token-step.tflite`   | 2.0 GB | **INT8** weight-only autoregressive step (MiniCPM-4 + residual LM) |
+| `voxcpm2-audio-encoder.tflite` | 184 MB | FP32 reference-audio encoder (16 kHz → conditioning) |
+| `voxcpm2-audio-decoder.tflite` | 175 MB | FP32 AudioVAE decoder (acoustic tokens → 48 kHz PCM) |
+| `tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json` | — | HF tokenizer bundle |
+| `generation_config.json` / `tokenization_voxcpm2.py` | — | Generation defaults + tokenizer module |
+| `config.json`                 | — | Tensor shapes, sample rates, files manifest |
+## Quantization
+- **token-step**: INT8 weight-only (the only graph that runs in
+  the inner generation loop — quantizing here is the biggest win).
+- **text-prefill / audio-encoder / audio-decoder**: stay FP32.
+  Quantizing prefill caused semantic drift in roundtrip; the
+  AudioVAE decoder is audible-risky under INT8.
+## Smoke result
+30-step English roundtrip (`"hello world from soniqo dot audio"`,
+instruction `"clear neutral delivery"`):
+- Stop token fired naturally at step 18 (decoder halted before
+  the 30-step ceiling)
+- 138 240 samples × 48 kHz mono = 2.88 s
+- RMS 0.033, peak 0.44 — no clipping, real signal level
+- Output written to `voxcpm2-litert-hello-world.wav`
+## Modes
+Mirrors the [speech-swift `VoxCPM2TTS`](https://github.com/soniqo/speech-swift)
+mode matrix:
+| Mode | Inputs |
+|---|---|
+| Zero-shot | text |
+| Voice design | text + style instruction |
+| Controllable cloning | text + reference audio |
+| Ultimate cloning | text + reference audio + prompt audio + prompt text |
+For Apple Silicon, prefer the MLX bundles
+([bf16](https://huggingface.co/aufklarer/VoxCPM2-MLX-bf16) /
+ [int8](https://huggingface.co/aufklarer/VoxCPM2-MLX-int8) /
+ [int4](https://huggingface.co/aufklarer/VoxCPM2-MLX-int4))
+consumed by `speech-swift`.
+## Source
+Exporter: `models/voxcpm2/export/convert_litert.py` in
+[speech-models](https://github.com/soniqo/speech-models),
+run in the pinned `Dockerfile.litert` environment.
+## Responsible use
+Voice cloning is included. Users are responsible for obtaining
+consent for any voice that is cloned and for not using the model
+to impersonate individuals without permission, generate
+disinformation, or commit fraud.
+## Ecosystem
+- [**soniqo.audio**](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents).
+- [**speech-cloud**](https://github.com/soniqo/speech-cloud) — C++ cloud API server. Runs LiteRT models behind `/v1/transcribe`, `/v1/realtime`, and (planned) `/v1/audio/speech`.
+- [**speech-core**](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
+- [**speech-models**](https://github.com/soniqo/speech-models) — the exporters that produced this bundle.
+- [**speech-swift**](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
+## Other LiteRT models in this collection
+**ASR / Transcription**
+- [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
+- [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
+- [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
+- [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
+- [Qwen3 ASR 0.6B Encoder — LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)
+**VAD / Diarization**
+- [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
+- [Pyannote Segmentation 3.0 — LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
+- [WeSpeaker ResNet34-LM — LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)
+## License
+This bundle inherits the upstream model license (**apache-2.0**). See the
+linked `base_model` repository for the full terms.