Qwen3-TTS 12Hz 0.6B Base β€” ONNX (Voice Cloning)

ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-Base for local inference with C# / ONNX Runtime. Includes ECAPA-TDNN speaker encoder for 3-second voice cloning.

Files

File Description Size
speaker_encoder.onnx + .data ECAPA-TDNN speaker encoder ~34 MB
talker_prefill.onnx + .data Talker LM prefill (28 layers) ~1.7 GB
talker_decode.onnx + .data Talker LM single-step decode ~1.7 GB
code_predictor.onnx Code Predictor (5 layers, 15 groups) ~440 MB
vocoder.onnx Vocoder decoder (24kHz output) ~2.7 MB
embeddings/ Text/codec embeddings as .npy + config ~1.4 GB
tokenizer/ BPE tokenizer (vocab.json, merges.txt) ~4 MB

Usage with C#

dotnet add package ElBruno.QwenTTS.VoiceCloning
using ElBruno.QwenTTS.VoiceCloning.Pipeline;

var cloner = await VoiceClonePipeline.CreateAsync();
await cloner.SynthesizeAsync("Hello world!", "reference.wav", "output.wav", "english");

Architecture

  • Speaker Encoder: ECAPA-TDNN, 128 mel bins input, 1024-dim output
  • Talker: 28 transformer layers, 16 attn heads, 8 KV heads, hidden=1024
  • Code Predictor: 5 layers, generates codebook groups 1-15
  • Vocoder: RVQ dequantize β†’ transformer β†’ BigVGAN decoder, 12Hz β†’ 24kHz

License

Apache-2.0 (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for elbruno/Qwen3-TTS-12Hz-0.6B-Base-ONNX

Quantized
(12)
this model