Kokoro 82M — ONNX

Kokoro-82M converted to a single end-to-end ONNX model for on-device TTS inference on Android.

Fuses BERT → duration prediction → alignment → prosody → iSTFTNet decoder into one model with fixed-shape I/O.

Model

Property	Value
Parameters	82M
Architecture	Non-autoregressive, single-pass
Format	ONNX (opset 18)
File size	310 MB (FP32)
Sample rate	24 kHz
Max phonemes	128
Max audio	5s (120,000 samples)
Voices	50+ (256-dim embeddings)
Languages	8 (en, fr, es, ja, zh, ko, hi, pt)

Files

File	Size	Description
`kokoro-e2e.onnx`	1.0 MB	Model graph
`kokoro-e2e.onnx.data`	310 MB	Model weights (external data)
`vocab_index.json`	2 KB	IPA phoneme → token ID mapping
`us_gold.json`	1.5 MB	Primary pronunciation dictionary
`us_silver.json`	3.2 MB	Fallback pronunciation dictionary
`voices/*.bin`	1 KB each	Voice embeddings (256 × float32)

Tensors

Input

Name	Shape	Type	Description
`input_ids`	`[1, 128]`	int64	Phoneme token IDs (zero-padded)
`attention_mask`	`[1, 128]`	int64	1 for real tokens, 0 for padding
`ref_s`	`[1, 256]`	float32	Voice style embedding
`speed`	`[1]`	float32	Speed factor (1.0 = normal)
`random_phases`	`[1, 9]`	float32	Initial harmonic phases (uniform [0,1))

Output

Name	Shape	Type	Description
`audio`	`[1, 1, 120000]`	float32	Raw PCM waveform (24 kHz)
`audio_length_samples`	`[1]`	int64	Valid sample count (trim audio to this)
`pred_dur`	`[1, 128]`	float32	Predicted phoneme durations

Usage

import numpy as np
import onnxruntime as ort

sess = ort.InferenceSession("kokoro-e2e.onnx")

# Prepare inputs (phoneme IDs from vocab_index.json)
input_ids = np.zeros((1, 128), dtype=np.int64)
input_ids[0, :5] = [0, 60, 46, 79, 0]  # example phonemes
attention_mask = np.zeros((1, 128), dtype=np.int64)
attention_mask[0, :5] = 1

# Load voice embedding (256 floats from .bin file)
voice = np.fromfile("voices/af_heart.bin", dtype=np.float32).reshape(1, 256)

output = sess.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "ref_s": voice,
    "speed": np.ones(1, dtype=np.float32),
    "random_phases": np.random.rand(1, 9).astype(np.float32),
})

audio = output[0].flatten()[:int(output[1][0])]  # trim to valid length

Source

Converted from hexgrad/Kokoro-82M using the E2E pipeline with FluidInference-aligned SineGen (segmented cumsum + phase wrapping).

Model tree for aufklarer/Kokoro-82M-ONNX

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Quantized

(30)

this model

Collection including aufklarer/Kokoro-82M-ONNX

Speech Android Models

Collection

Mobile ONNX models for speech-android SDK • 5 items • Updated 6 days ago • 1

aufklarer
/

Kokoro-82M-ONNX