WaveKat TTS

Qwen3-TTS 1.7B VoiceDesign (ONNX)

ONNX export of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign for inference with ONNX Runtime. No PyTorch required at inference time.

Both FP32 and INT4 (weight-only, RTN) variants are included.

Exported and maintained by WaveKat as part of the wavekat-tts voice pipeline.

Quick Start

pip install -r requirements.txt

# FP32
python generate_onnx.py --text "Give every small business the voice of a big one." \
  --instruct "Speak in a warm and friendly female voice" \
  -o output_fp32.wav

# INT4 (~4x smaller, faster)
python generate_onnx.py --variant int4 \
  --text "Give every small business the voice of a big one." \
  --instruct "Speak in a warm and friendly female voice" \
  -o output_int4.wav

# Chinese
python generate_onnx.py --variant int4 --lang chinese \
  --text "่ฎฉๆฏไธ€ๅฎถๅฐไผไธš๏ผŒ้ƒฝๆ‹ฅๆœ‰ๅคงไผไธš็š„ๅฃฐ้Ÿณใ€‚" \
  --instruct "Speak in a warm and professional female voice" \
  -o output_zh.wav

Model Architecture

Qwen3-TTS is a three-stage autoregressive pipeline:

Text --> [Tokenizer + Embedding Construction] --> inputs_embeds
             |
             v
     [Talker LM]           28 layers, 2048 hidden
     predicts codebook group 0
             |
             v
     [Code Predictor]      5 layers, 1024 hidden
     predicts groups 1-15
             |
             v
     [Vocoder]             single forward pass
     16 codebook groups --> 24kHz waveform

The pipeline is split into 4 ONNX models:

Model Description FP32 Size INT4 Size
talker_prefill.onnx Full sequence prefill with KV cache output 5.3 GB 1.4 GB
talker_decode.onnx Single-step decode with KV cache 5.3 GB 1.4 GB
code_predictor.onnx Predict codebook groups 1-15 440 MB 322 MB
vocoder.onnx Codes to 24kHz waveform 876 MB 558 MB

Repository Structure

.
โ”œโ”€โ”€ config.json              # Model config (dimensions, token IDs, language map)
โ”œโ”€โ”€ tokenizer/               # Text tokenizer (vocab, merges, config)
โ”œโ”€โ”€ embeddings/              # Pre-extracted embedding weights (.npy)
โ”œโ”€โ”€ fp32/                    # FP32 ONNX models
โ”‚   โ”œโ”€โ”€ talker_prefill.onnx
โ”‚   โ”œโ”€โ”€ talker_decode.onnx
โ”‚   โ”œโ”€โ”€ code_predictor.onnx
โ”‚   โ””โ”€โ”€ vocoder.onnx
โ”œโ”€โ”€ int4/                    # INT4 weight-only quantized models
โ”‚   โ”œโ”€โ”€ talker_prefill.onnx
โ”‚   โ”œโ”€โ”€ talker_decode.onnx
โ”‚   โ”œโ”€โ”€ code_predictor.onnx
โ”‚   โ””โ”€โ”€ vocoder.onnx
โ”œโ”€โ”€ generate_onnx.py         # Reference ONNX-only inference script
โ””โ”€โ”€ requirements.txt         # Inference dependencies

Supported Languages

English, Chinese, Japanese, Korean, German, French, Spanish, Italian, Portuguese, Russian.

Reproducing the Export

The export scripts are in the wavekat-tts repository:

cd tools/qwen3-tts-onnx
pip install -r requirements.txt

# Export FP32, validate, and quantize INT4
make all

About WaveKat

WaveKat builds open-source voice pipeline components in Rust. This ONNX export is maintained as part of wavekat-tts, which provides unified TTS inference across multiple backends.

Acknowledgements

  • Qwen3-TTS by the Qwen team at Alibaba Cloud
Downloads last month
124
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wavekat/Qwen3-TTS-1.7B-VoiceDesign-ONNX

Quantized
(2)
this model