Qwen3-TTS 12Hz 0.6B CustomVoice β ONNX
ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice for local inference with C# / ONNX Runtime.
Files
| File | Description | Size |
|---|---|---|
talker_prefill.onnx + .data |
Talker LM prefill (28 layers) | ~1.7 GB |
talker_decode.onnx + .data |
Talker LM single-step decode | ~1.7 GB |
code_predictor.onnx |
Code Predictor (5 layers, 15 groups) | ~420 MB |
vocoder.onnx + .data |
Vocoder decoder (24kHz output) | ~437 MB |
embeddings/ |
Text/codec embeddings as .npy + config | ~1.4 GB |
tokenizer/ |
BPE tokenizer (vocab.json, merges.txt) | ~4 MB |
Usage with C#
# Clone the app repo
git clone https://github.com/elbruno/qwen-labs-cs.git
cd qwen-labs-cs
# Download models
python python/download_onnx_models.py --repo-id elbruno/Qwen3-TTS-12Hz-0.6B-CustomVoice-ONNX
# Run
dotnet run --project src/QwenTTS -- --model-dir python/onnx_runtime --text "Hello world" --speaker ryan --language english
Architecture
- Talker: 28 transformer layers, 16 attn heads, 8 KV heads, hidden=1024
- Code Predictor: 5 layers, generates codebook groups 1-15
- Vocoder: RVQ dequantize β transformer β BigVGAN decoder, 12Hz β 24kHz (1920Γ upsample)
- KV Cache: Decode uses stacked format (num_layers, B, num_kv_heads, T, head_dim)
- Speakers: serena, vivian, uncle_fu, ryan, aiden, ono_anna, sohee, eric, dylan
License
Apache-2.0 (same as base model)
Model tree for elbruno/Qwen3-TTS-12Hz-0.6B-CustomVoice-ONNX
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice