aufklarer commited on
Commit
ecc3f03
Β·
verified Β·
1 Parent(s): 3c9bcbf

card: unified LiteRT model card with soniqo.audio + ecosystem links

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - id
7
+ - ja
8
+ - ko
9
+ - multilingual
10
+ tags:
11
+ - text-to-speech
12
+ - tts
13
+ - voice-cloning
14
+ - voice-design
15
+ - diffusion
16
+ - litert
17
+ - tflite
18
+ - on-device
19
+ - soniqo
20
+ - speech-cloud
21
+ - speech-core
22
+ base_model: openbmb/VoxCPM2
23
+ library_name: litert
24
+ pipeline_tag: text-to-speech
25
+ ---
26
+
27
+ # VoxCPM2 β€” LiteRT (INT8)
28
+
29
+ 2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.
30
+
31
+ > Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β€”
32
+ > an open, runtime-portable stack for speech AI. This bundle is the
33
+ > **LiteRT** export; served from cloud by
34
+ > [`speech-cloud`](https://github.com/soniqo/speech-cloud) and embeddable
35
+ > on-device through [`speech-core`](https://github.com/soniqo/speech-core).
36
+ > Browse all LiteRT bundles in the
37
+ > [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert).
38
+
39
+ ## Use cases on soniqo.audio
40
+
41
+ - [Speech generation](https://soniqo.audio/speech-generation/)
42
+ - [Voice cloning](https://soniqo.audio/voice-cloning/)
43
+ - [Long-form speech](https://soniqo.audio/long-form-speech/)
44
+
45
+ LiteRT export of [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2)
46
+ β€” a 2 B-parameter diffusion-autoregressive TTS with 48 kHz
47
+ studio-quality output, reference-audio voice cloning, and
48
+ natural-language voice design. Consumed by the
49
+ [`speech-cloud`](https://github.com/soniqo/speech-cloud)
50
+ synthesis worker (`--mode=synthesize-worker`).
51
+
52
+ ## Why split graphs
53
+
54
+ VoxCPM2 is not a single feed-forward model. The runtime loop is
55
+
56
+ ```
57
+ text + optional instruction ──► text-prefill
58
+ β”‚
59
+ β–Ό
60
+ repeated token-step
61
+ β”‚
62
+ β–Ό
63
+ audio-decoder ──► 48 kHz PCM
64
+ ```
65
+
66
+ The C++ worker owns the loop and the KV cache; LiteRT owns the
67
+ static tensor programs. Same split that `speech-cloud` uses for
68
+ Parakeet and Nemotron β€” LiteRT for the math, C++ for the control
69
+ flow.
70
+
71
+ ## Files
72
+
73
+ | File | Size | Description |
74
+ |---|---:|---|
75
+ | `voxcpm2-text-prefill.tflite` | 7.7 GB | FP32 text + instruction prefill (MiniCPM-4 KV-cache producer) |
76
+ | `voxcpm2-token-step.tflite` | 2.0 GB | **INT8** weight-only autoregressive step (MiniCPM-4 + residual LM) |
77
+ | `voxcpm2-audio-encoder.tflite` | 184 MB | FP32 reference-audio encoder (16 kHz β†’ conditioning) |
78
+ | `voxcpm2-audio-decoder.tflite` | 175 MB | FP32 AudioVAE decoder (acoustic tokens β†’ 48 kHz PCM) |
79
+ | `tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json` | β€” | HF tokenizer bundle |
80
+ | `generation_config.json` / `tokenization_voxcpm2.py` | β€” | Generation defaults + tokenizer module |
81
+ | `config.json` | β€” | Tensor shapes, sample rates, files manifest |
82
+
83
+ ## Quantization
84
+
85
+ - **token-step**: INT8 weight-only (the only graph that runs in
86
+ the inner generation loop β€” quantizing here is the biggest win).
87
+ - **text-prefill / audio-encoder / audio-decoder**: stay FP32.
88
+ Quantizing prefill caused semantic drift in roundtrip; the
89
+ AudioVAE decoder is audible-risky under INT8.
90
+
91
+ ## Smoke result
92
+
93
+ 30-step English roundtrip (`"hello world from soniqo dot audio"`,
94
+ instruction `"clear neutral delivery"`):
95
+
96
+ - Stop token fired naturally at step 18 (decoder halted before
97
+ the 30-step ceiling)
98
+ - 138 240 samples Γ— 48 kHz mono = 2.88 s
99
+ - RMS 0.033, peak 0.44 β€” no clipping, real signal level
100
+ - Output written to `voxcpm2-litert-hello-world.wav`
101
+
102
+ ## Modes
103
+
104
+ Mirrors the [speech-swift `VoxCPM2TTS`](https://github.com/soniqo/speech-swift)
105
+ mode matrix:
106
+
107
+ | Mode | Inputs |
108
+ |---|---|
109
+ | Zero-shot | text |
110
+ | Voice design | text + style instruction |
111
+ | Controllable cloning | text + reference audio |
112
+ | Ultimate cloning | text + reference audio + prompt audio + prompt text |
113
+
114
+ For Apple Silicon, prefer the MLX bundles
115
+ ([bf16](https://huggingface.co/aufklarer/VoxCPM2-MLX-bf16) /
116
+ [int8](https://huggingface.co/aufklarer/VoxCPM2-MLX-int8) /
117
+ [int4](https://huggingface.co/aufklarer/VoxCPM2-MLX-int4))
118
+ consumed by `speech-swift`.
119
+
120
+ ## Source
121
+
122
+ Exporter: `models/voxcpm2/export/convert_litert.py` in
123
+ [speech-models](https://github.com/soniqo/speech-models),
124
+ run in the pinned `Dockerfile.litert` environment.
125
+
126
+ ## Responsible use
127
+
128
+ Voice cloning is included. Users are responsible for obtaining
129
+ consent for any voice that is cloned and for not using the model
130
+ to impersonate individuals without permission, generate
131
+ disinformation, or commit fraud.
132
+
133
+ ## Ecosystem
134
+
135
+ - [**soniqo.audio**](https://soniqo.audio) β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
136
+ - [**speech-cloud**](https://github.com/soniqo/speech-cloud) β€” C++ cloud API server. Runs LiteRT models behind `/v1/transcribe`, `/v1/realtime`, and (planned) `/v1/audio/speech`.
137
+ - [**speech-core**](https://github.com/soniqo/speech-core) β€” C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
138
+ - [**speech-models**](https://github.com/soniqo/speech-models) β€” the exporters that produced this bundle.
139
+ - [**speech-swift**](https://github.com/soniqo/speech-swift) β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
140
+
141
+ ## Other LiteRT models in this collection
142
+
143
+ **ASR / Transcription**
144
+
145
+ - [Parakeet TDT 0.6B v3 β€” LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
146
+ - [Nemotron Speech Streaming 0.6B β€” LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
147
+ - [Omnilingual ASR CTC 300M β€” LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
148
+ - [Omnilingual ASR CTC 300M β€” LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
149
+ - [Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)
150
+
151
+ **VAD / Diarization**
152
+
153
+ - [Silero VAD v5 β€” LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
154
+ - [Pyannote Segmentation 3.0 β€” LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
155
+ - [WeSpeaker ResNet34-LM β€” LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)
156
+
157
+ ## License
158
+
159
+ This bundle inherits the upstream model license (**apache-2.0**). See the
160
+ linked `base_model` repository for the full terms.