Text-to-Speech
LiteRT
LiteRT
tts
voice-cloning
voice-design
diffusion
on-device
soniqo
speech-cloud
speech-core
Instructions to use soniqo/VoxCPM2-LiteRT-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use soniqo/VoxCPM2-LiteRT-INT8 with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| - id | |
| - ja | |
| - ko | |
| - multilingual | |
| tags: | |
| - text-to-speech | |
| - tts | |
| - voice-cloning | |
| - voice-design | |
| - diffusion | |
| - litert | |
| - tflite | |
| - on-device | |
| - soniqo | |
| - speech-cloud | |
| - speech-core | |
| base_model: openbmb/VoxCPM2 | |
| library_name: litert | |
| pipeline_tag: text-to-speech | |
| # VoxCPM2 β LiteRT (INT8) | |
| 2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output. | |
| > Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β | |
| > an open, runtime-portable stack for speech AI. This bundle is the | |
| > **LiteRT** export, designed to plug into the abstract interfaces in | |
| > [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent | |
| > orchestration library). Browse all LiteRT bundles in the | |
| > [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b). | |
| ## Use cases on soniqo.audio | |
| - [Speech generation](https://soniqo.audio/speech-generation/) | |
| - [Voice cloning](https://soniqo.audio/voice-cloning/) | |
| - [Long-form speech](https://soniqo.audio/long-form-speech/) | |
| LiteRT export of [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2) | |
| β a 2 B-parameter diffusion-autoregressive TTS with 48 kHz | |
| studio-quality output, reference-audio voice cloning, and | |
| natural-language voice design. Designed for server-side | |
| synthesis workers and on-device TTS through the | |
| [`speech-core`](https://github.com/soniqo/speech-core) | |
| `TTSInterface`. | |
| ## Why split graphs | |
| VoxCPM2 is not a single feed-forward model. The runtime loop is | |
| ``` | |
| text + optional instruction βββΊ text-prefill | |
| β | |
| βΌ | |
| repeated token-step | |
| β | |
| βΌ | |
| audio-decoder βββΊ 48 kHz PCM | |
| ``` | |
| The host owns the loop and the KV cache; LiteRT owns the | |
| static tensor programs. Same split used for Parakeet and | |
| Nemotron in this collection β LiteRT for the math, host for | |
| the control flow. | |
| ## Files | |
| | File | Size | Description | | |
| |---|---:|---| | |
| | `voxcpm2-text-prefill.tflite` | 7.7 GB | FP32 text + instruction prefill (MiniCPM-4 KV-cache producer) | | |
| | `voxcpm2-token-step.tflite` | 2.0 GB | **INT8** weight-only autoregressive step (MiniCPM-4 + residual LM) | | |
| | `voxcpm2-audio-encoder.tflite` | 184 MB | FP32 reference-audio encoder (16 kHz β conditioning) | | |
| | `voxcpm2-audio-decoder.tflite` | 175 MB | FP32 AudioVAE decoder (acoustic tokens β 48 kHz PCM) | | |
| | `tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json` | β | HF tokenizer bundle | | |
| | `generation_config.json` / `tokenization_voxcpm2.py` | β | Generation defaults + tokenizer module | | |
| | `config.json` | β | Tensor shapes, sample rates, files manifest | | |
| ## Quantization | |
| - **token-step**: INT8 weight-only (the only graph that runs in | |
| the inner generation loop β quantizing here is the biggest win). | |
| - **text-prefill / audio-encoder / audio-decoder**: stay FP32. | |
| Quantizing prefill caused semantic drift in roundtrip; the | |
| AudioVAE decoder is audible-risky under INT8. | |
| ## Smoke result | |
| 30-step English roundtrip (`"hello world from soniqo dot audio"`, | |
| instruction `"clear neutral delivery"`): | |
| - Stop token fired naturally at step 18 (decoder halted before | |
| the 30-step ceiling) | |
| - 138 240 samples Γ 48 kHz mono = 2.88 s | |
| - RMS 0.033, peak 0.44 β no clipping, real signal level | |
| - Output written to `voxcpm2-litert-hello-world.wav` | |
| ## Modes | |
| Mirrors the [speech-swift `VoxCPM2TTS`](https://github.com/soniqo/speech-swift) | |
| mode matrix: | |
| | Mode | Inputs | | |
| |---|---| | |
| | Zero-shot | text | | |
| | Voice design | text + style instruction | | |
| | Controllable cloning | text + reference audio | | |
| | Ultimate cloning | text + reference audio + prompt audio + prompt text | | |
| For Apple Silicon, prefer the MLX bundles | |
| ([bf16](https://huggingface.co/aufklarer/VoxCPM2-MLX-bf16) / | |
| [int8](https://huggingface.co/aufklarer/VoxCPM2-MLX-int8) / | |
| [int4](https://huggingface.co/aufklarer/VoxCPM2-MLX-int4)) | |
| consumed by `speech-swift`. | |
| ## Source | |
| Exported from [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2) | |
| via a graph-split LiteRT conversion, run in a pinned Docker | |
| environment because LiteRT / Torch / TorchAO versions are | |
| tightly coupled. | |
| ## Responsible use | |
| Voice cloning is included. Users are responsible for obtaining | |
| consent for any voice that is cloned and for not using the model | |
| to impersonate individuals without permission, generate | |
| disinformation, or commit fraud. | |
| ## Ecosystem | |
| - [**soniqo.audio**](https://soniqo.audio) β use-case explorer (transcription, voice cloning, live ASR, voice agents). | |
| - [**speech-core**](https://github.com/soniqo/speech-core) β C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces. | |
| - [**speech-swift**](https://github.com/soniqo/speech-swift) β Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable). | |
| - [**speech-android**](https://github.com/soniqo/speech-android) β Android SDK consuming on-device LiteRT bundles. | |
| ## Other LiteRT models in this collection | |
| **ASR / Transcription** | |
| - [Parakeet TDT 0.6B v3 β LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8) | |
| - [Nemotron Speech Streaming 0.6B β LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT) | |
| - [Omnilingual ASR CTC 300M β LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT) | |
| - [Omnilingual ASR CTC 300M β LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8) | |
| - [Qwen3 ASR 0.6B Encoder β LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8) | |
| **VAD / Diarization** | |
| - [Silero VAD v5 β LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT) | |
| - [Pyannote Segmentation 3.0 β LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT) | |
| - [WeSpeaker ResNet34-LM β LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT) | |
| ## License | |
| This bundle inherits the upstream model license (**apache-2.0**). See the | |
| linked `base_model` repository for the full terms. | |