Qwen3-TTS 12Hz 0.6B Base β ONNX (Voice Cloning)
ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-Base for local inference with C# / ONNX Runtime. Includes ECAPA-TDNN speaker encoder for 3-second voice cloning.
Files
| File | Description | Size |
|---|---|---|
speaker_encoder.onnx + .data |
ECAPA-TDNN speaker encoder | ~34 MB |
talker_prefill.onnx + .data |
Talker LM prefill (28 layers) | ~1.7 GB |
talker_decode.onnx + .data |
Talker LM single-step decode | ~1.7 GB |
code_predictor.onnx |
Code Predictor (5 layers, 15 groups) | ~440 MB |
vocoder.onnx |
Vocoder decoder (24kHz output) | ~2.7 MB |
embeddings/ |
Text/codec embeddings as .npy + config | ~1.4 GB |
tokenizer/ |
BPE tokenizer (vocab.json, merges.txt) | ~4 MB |
Usage with C#
dotnet add package ElBruno.QwenTTS.VoiceCloning
using ElBruno.QwenTTS.VoiceCloning.Pipeline;
var cloner = await VoiceClonePipeline.CreateAsync();
await cloner.SynthesizeAsync("Hello world!", "reference.wav", "output.wav", "english");
Architecture
- Speaker Encoder: ECAPA-TDNN, 128 mel bins input, 1024-dim output
- Talker: 28 transformer layers, 16 attn heads, 8 KV heads, hidden=1024
- Code Predictor: 5 layers, generates codebook groups 1-15
- Vocoder: RVQ dequantize β transformer β BigVGAN decoder, 12Hz β 24kHz
License
Apache-2.0 (same as base model)
Model tree for elbruno/Qwen3-TTS-12Hz-0.6B-Base-ONNX
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base