Qwen3-TTS ONNX

ONNX-exported version of Qwen3-TTS for portable, framework-free speech synthesis. Supports Voice Clone (clone a voice from reference audio) and Voice Design (create a voice from a text description).

Note: This is a community ONNX conversion of the official Qwen3-TTS models. For the original PyTorch models, training details, benchmarks, and academic citations, please refer to the official Qwen3-TTS repository.

Features

  • Pure ONNX inference β€” no PyTorch dependency required at runtime
  • Voice Clone: clone any voice from a short (3s+) reference audio clip
  • Voice Design: create a voice from natural-language descriptions (e.g., "A warm, gentle young female voice")
  • Voice Design then Clone: design a voice via description, then save it as a reusable speaker profile
  • Pre-computed global cache: constant embeddings are pre-computed and shipped, reducing cold-start time
  • FP16 and FP32 models included
  • 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Repository Structure

.
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ onnx_inference.py               # Core inference engine (library)
β”œβ”€β”€ generate_cache.py               # Step 1: Generate global embedding cache
β”œβ”€β”€ create_speaker.py               # Step 2: Create speaker profile
β”œβ”€β”€ synthesize.py                   # Step 3: Synthesize speech
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ voice_clone_config.json         # Model config (voice clone)
β”œβ”€β”€ voice_design_config.json        # Model config (voice design)
β”œβ”€β”€ dual_model_config.json          # Combined model config
β”œβ”€β”€ tokenizer/                      # Tokenizer files
β”‚   β”œβ”€β”€ tokenizer.json              # Rust tokenizer (recommended, fast)
β”‚   β”œβ”€β”€ vocab.json                  # BPE vocabulary
β”‚   β”œβ”€β”€ merges.txt                  # BPE merge rules
β”‚   └── tokenizer_config.json       # Tokenizer config
β”œβ”€β”€ fp16/                           # FP16 ONNX models (recommended)
β”‚   β”œβ”€β”€ shared/                     # Shared models (speaker encoder, speech tokenizer)
β”‚   β”‚   β”œβ”€β”€ speaker_encoder.onnx
β”‚   β”‚   β”œβ”€β”€ speech_tokenizer_encoder.onnx
β”‚   β”‚   └── speech_tokenizer_decoder.onnx
β”‚   β”œβ”€β”€ voice_clone/                # Voice Clone talker models + cache
β”‚   β”‚   β”œβ”€β”€ talker_decode.onnx
β”‚   β”‚   β”œβ”€β”€ code_predictor.onnx
β”‚   β”‚   β”œβ”€β”€ code_predictor_kv.onnx
β”‚   β”‚   β”œβ”€β”€ text_embedding.onnx
β”‚   β”‚   β”œβ”€β”€ codec_embedding.onnx
β”‚   β”‚   β”œβ”€β”€ code_predictor_embed_g*.onnx
β”‚   β”‚   β”œβ”€β”€ model_cache.npz         # Pre-computed global cache
β”‚   β”‚   └── *.weight                # External data files for ONNX models
β”‚   └── voice_design/               # Voice Design talker models + cache
β”‚       β”œβ”€β”€ talker_decode.onnx
β”‚       β”œβ”€β”€ code_predictor.onnx
β”‚       β”œβ”€β”€ code_predictor_kv.onnx
β”‚       β”œβ”€β”€ text_embedding.onnx
β”‚       β”œβ”€β”€ codec_embedding.onnx
β”‚       β”œβ”€β”€ code_predictor_embed_g*.onnx
β”‚       β”œβ”€β”€ model_cache.npz
β”‚       └── *.weight
└── onnx/                           # FP32 ONNX models (alternative)
    └── (same structure as fp16/)

Model Size

Variant Approx. Size Note
FP16 only ~X GB Recommended, smaller and faster
FP32 only ~X GB Higher precision, larger
Both (full repo) ~X GB Includes both variants

Tip: If you only need FP16 (recommended), you can selectively download:

huggingface-cli download <REPO_ID> --include "fp16/**" "tokenizer/**" "*.json" "*.py" "*.txt" --local-dir ./model

Quick Start

1. Install Dependencies

pip install onnxruntime numpy librosa soundfile tokenizers

For GPU acceleration:

pip install onnxruntime-gpu numpy librosa soundfile tokenizers

2. Download the Model

# Using huggingface-cli
pip install -U "huggingface_hub[cli]"
huggingface-cli download <YOUR_HF_REPO_ID> --local-dir ./model

# Or using git lfs
git lfs install
git clone https://huggingface.co/<YOUR_HF_REPO_ID> ./model

3. Three-Step Pipeline

The inference pipeline consists of three independent scripts. Steps 1 and 2 only need to be run once; Step 3 is the actual synthesis.

Step 1: Generate Global Cache (one-time)

Pre-compute constant embeddings shared across all speakers and texts. The cache files (model_cache.npz) are already included in this repository, so you can skip this step unless you re-export the models.

python generate_cache.py --model_dir ./model

Step 2: Create Speaker Profile (once per voice)

Create a reusable speaker profile from either a reference audio or a voice description.

Option A β€” Voice Clone (from reference audio):

python create_speaker.py \
    --model_dir ./model \
    --ref_audio reference.wav \
    --ref_text "Transcript of the reference audio" \
    --language english \
    --output ./speakers/my_voice.npz

Option B β€” Voice Design then Clone (from text description):

python create_speaker.py \
    --model_dir ./model \
    --instruct "A warm, gentle young female voice" \
    --design_text "Hello, this is a sample of my voice." \
    --language chinese \
    --output ./speakers/designed_voice.npz

The speaker profile (.npz) contains pre-computed speaker embeddings and codec features, making subsequent synthesis fast.

Step 3: Synthesize Speech

Using a speaker profile (Voice Clone):

python synthesize.py \
    --model_dir ./model \
    --speaker ./speakers/my_voice.npz \
    --text "The weather is wonderful today." \
    --output output.wav

Direct Voice Design (no speaker profile needed):

python synthesize.py \
    --model_dir ./model \
    --instruct "Speak in a cheerful, energetic young male voice" \
    --text "The weather is wonderful today." \
    --output output.wav

Generation Parameters

All synthesis commands support these optional parameters:

Parameter Default Description
--language chinese Language: chinese, english, japanese, korean, auto
--precision fp16 Model precision: fp16, fp32
--temperature 0.9 Sampling temperature (higher = more varied)
--top_k 50 Top-K sampling
--top_p 1.0 Top-P (nucleus) sampling
--repetition_penalty 1.05 Repetition penalty (>1.0 to reduce repeats)
--max_tokens 2048 Maximum generated tokens
--seed None Random seed for reproducibility
--use_gpu False Use GPU acceleration

Pipeline Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 1: generate_cache.py  (one-time setup)        β”‚
β”‚  Computes model-level constant embeddings           β”‚
β”‚  β†’ model_cache.npz                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 2: create_speaker.py  (once per voice)        β”‚
β”‚  Mode A: ref_audio + ref_text β†’ speaker.npz         β”‚
β”‚  Mode B: instruct β†’ voice design β†’ speaker.npz      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 3: synthesize.py  (each synthesis request)    β”‚
β”‚  speaker.npz + text β†’ audio.wav  (Voice Clone)      β”‚
β”‚  instruct + text β†’ audio.wav     (Voice Design)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Details

This repository contains ONNX exports of the following Qwen3-TTS components:

Component Description
Speaker Encoder Extracts speaker embedding (x-vector) from reference audio
Speech Tokenizer Encoder/decoder for audio ↔ discrete codec codes
Text Embedding Maps text token IDs to embedding vectors
Codec Embedding Maps codec token IDs to embedding vectors
Talker (Voice Clone) Autoregressive transformer for voice clone synthesis
Talker (Voice Design) Autoregressive transformer for voice design synthesis
Code Predictor Predicts multi-codebook codes from talker hidden states

Supported Languages

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Requirements

  • Python 3.9+
  • onnxruntime >= 1.16 (or onnxruntime-gpu for GPU)
  • numpy >= 1.24
  • librosa >= 0.10
  • soundfile >= 0.12
  • tokenizers >= 0.15 (recommended, pure Rust, fast) or tiktoken (fallback)

Credits

This is an ONNX conversion of Qwen3-TTS by the Qwen team at Alibaba. All credit for the model architecture, training, and research goes to the original authors.

Citation

If you use this model, please cite the original paper:

@article{Qwen3-TTS,
  title={Qwen3-TTS Technical Report},
  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

License

Please refer to the original Qwen3-TTS license for usage terms.

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Soundly/Qwen3-TTS-12Hz-1.7B-ONNX-INT4

Quantized
(8)
this model

Paper for Soundly/Qwen3-TTS-12Hz-1.7B-ONNX-INT4