Kokoro 82M β€” ONNX

Kokoro-82M converted to a single end-to-end ONNX model for on-device TTS inference on Android.

Fuses BERT β†’ duration prediction β†’ alignment β†’ prosody β†’ iSTFTNet decoder into one model with fixed-shape I/O.

Model

Property Value
Parameters 82M
Architecture Non-autoregressive, single-pass
Format ONNX (opset 18)
File size 310 MB (FP32)
Sample rate 24 kHz
Max phonemes 128
Max audio 5s (120,000 samples)
Voices 50+ (256-dim embeddings)
Languages 8 (en, fr, es, ja, zh, ko, hi, pt)

Files

File Size Description
kokoro-e2e.onnx 1.0 MB Model graph
kokoro-e2e.onnx.data 310 MB Model weights (external data)
vocab_index.json 2 KB IPA phoneme β†’ token ID mapping
us_gold.json 1.5 MB Primary pronunciation dictionary
us_silver.json 3.2 MB Fallback pronunciation dictionary
voices/*.bin 1 KB each Voice embeddings (256 Γ— float32)

Tensors

Input

Name Shape Type Description
input_ids [1, 128] int64 Phoneme token IDs (zero-padded)
attention_mask [1, 128] int64 1 for real tokens, 0 for padding
ref_s [1, 256] float32 Voice style embedding
speed [1] float32 Speed factor (1.0 = normal)
random_phases [1, 9] float32 Initial harmonic phases (uniform [0,1))

Output

Name Shape Type Description
audio [1, 1, 120000] float32 Raw PCM waveform (24 kHz)
audio_length_samples [1] int64 Valid sample count (trim audio to this)
pred_dur [1, 128] float32 Predicted phoneme durations

Usage

import numpy as np
import onnxruntime as ort

sess = ort.InferenceSession("kokoro-e2e.onnx")

# Prepare inputs (phoneme IDs from vocab_index.json)
input_ids = np.zeros((1, 128), dtype=np.int64)
input_ids[0, :5] = [0, 60, 46, 79, 0]  # example phonemes
attention_mask = np.zeros((1, 128), dtype=np.int64)
attention_mask[0, :5] = 1

# Load voice embedding (256 floats from .bin file)
voice = np.fromfile("voices/af_heart.bin", dtype=np.float32).reshape(1, 256)

output = sess.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "ref_s": voice,
    "speed": np.ones(1, dtype=np.float32),
    "random_phases": np.random.rand(1, 9).astype(np.float32),
})

audio = output[0].flatten()[:int(output[1][0])]  # trim to valid length

Source

Converted from hexgrad/Kokoro-82M using the E2E pipeline with FluidInference-aligned SineGen (segmented cumsum + phase wrapping).

Links


Downloads last month
169
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Kokoro-82M-ONNX

Quantized
(30)
this model

Collection including aufklarer/Kokoro-82M-ONNX