Speech Android Models
Collection
Mobile ONNX models for speech-android SDK β’ 5 items β’ Updated β’ 1
Kokoro-82M converted to a single end-to-end ONNX model for on-device TTS inference on Android.
Fuses BERT β duration prediction β alignment β prosody β iSTFTNet decoder into one model with fixed-shape I/O.
| Property | Value |
|---|---|
| Parameters | 82M |
| Architecture | Non-autoregressive, single-pass |
| Format | ONNX (opset 18) |
| File size | 310 MB (FP32) |
| Sample rate | 24 kHz |
| Max phonemes | 128 |
| Max audio | 5s (120,000 samples) |
| Voices | 50+ (256-dim embeddings) |
| Languages | 8 (en, fr, es, ja, zh, ko, hi, pt) |
| File | Size | Description |
|---|---|---|
kokoro-e2e.onnx |
1.0 MB | Model graph |
kokoro-e2e.onnx.data |
310 MB | Model weights (external data) |
vocab_index.json |
2 KB | IPA phoneme β token ID mapping |
us_gold.json |
1.5 MB | Primary pronunciation dictionary |
us_silver.json |
3.2 MB | Fallback pronunciation dictionary |
voices/*.bin |
1 KB each | Voice embeddings (256 Γ float32) |
| Name | Shape | Type | Description |
|---|---|---|---|
input_ids |
[1, 128] |
int64 | Phoneme token IDs (zero-padded) |
attention_mask |
[1, 128] |
int64 | 1 for real tokens, 0 for padding |
ref_s |
[1, 256] |
float32 | Voice style embedding |
speed |
[1] |
float32 | Speed factor (1.0 = normal) |
random_phases |
[1, 9] |
float32 | Initial harmonic phases (uniform [0,1)) |
| Name | Shape | Type | Description |
|---|---|---|---|
audio |
[1, 1, 120000] |
float32 | Raw PCM waveform (24 kHz) |
audio_length_samples |
[1] |
int64 | Valid sample count (trim audio to this) |
pred_dur |
[1, 128] |
float32 | Predicted phoneme durations |
import numpy as np
import onnxruntime as ort
sess = ort.InferenceSession("kokoro-e2e.onnx")
# Prepare inputs (phoneme IDs from vocab_index.json)
input_ids = np.zeros((1, 128), dtype=np.int64)
input_ids[0, :5] = [0, 60, 46, 79, 0] # example phonemes
attention_mask = np.zeros((1, 128), dtype=np.int64)
attention_mask[0, :5] = 1
# Load voice embedding (256 floats from .bin file)
voice = np.fromfile("voices/af_heart.bin", dtype=np.float32).reshape(1, 256)
output = sess.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
"ref_s": voice,
"speed": np.ones(1, dtype=np.float32),
"random_phases": np.random.rand(1, 9).astype(np.float32),
})
audio = output[0].flatten()[:int(output[1][0])] # trim to valid length
Converted from hexgrad/Kokoro-82M using the E2E pipeline with FluidInference-aligned SineGen (segmented cumsum + phase wrapping).