Qwen3-TTS ONNX
ONNX-exported version of Qwen3-TTS for portable, framework-free speech synthesis. Supports Voice Clone (clone a voice from reference audio) and Voice Design (create a voice from a text description).
Note: This is a community ONNX conversion of the official Qwen3-TTS models. For the original PyTorch models, training details, benchmarks, and academic citations, please refer to the official Qwen3-TTS repository.
Features
- Pure ONNX inference β no PyTorch dependency required at runtime
- Voice Clone: clone any voice from a short (3s+) reference audio clip
- Voice Design: create a voice from natural-language descriptions (e.g., "A warm, gentle young female voice")
- Voice Design then Clone: design a voice via description, then save it as a reusable speaker profile
- Pre-computed global cache: constant embeddings are pre-computed and shipped, reducing cold-start time
- FP16 and FP32 models included
- 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Repository Structure
.
βββ README.md # This file
βββ onnx_inference.py # Core inference engine (library)
βββ generate_cache.py # Step 1: Generate global embedding cache
βββ create_speaker.py # Step 2: Create speaker profile
βββ synthesize.py # Step 3: Synthesize speech
βββ requirements.txt # Python dependencies
βββ voice_clone_config.json # Model config (voice clone)
βββ voice_design_config.json # Model config (voice design)
βββ dual_model_config.json # Combined model config
βββ tokenizer/ # Tokenizer files
β βββ tokenizer.json # Rust tokenizer (recommended, fast)
β βββ vocab.json # BPE vocabulary
β βββ merges.txt # BPE merge rules
β βββ tokenizer_config.json # Tokenizer config
βββ fp16/ # FP16 ONNX models (recommended)
β βββ shared/ # Shared models (speaker encoder, speech tokenizer)
β β βββ speaker_encoder.onnx
β β βββ speech_tokenizer_encoder.onnx
β β βββ speech_tokenizer_decoder.onnx
β βββ voice_clone/ # Voice Clone talker models + cache
β β βββ talker_decode.onnx
β β βββ code_predictor.onnx
β β βββ code_predictor_kv.onnx
β β βββ text_embedding.onnx
β β βββ codec_embedding.onnx
β β βββ code_predictor_embed_g*.onnx
β β βββ model_cache.npz # Pre-computed global cache
β β βββ *.weight # External data files for ONNX models
β βββ voice_design/ # Voice Design talker models + cache
β βββ talker_decode.onnx
β βββ code_predictor.onnx
β βββ code_predictor_kv.onnx
β βββ text_embedding.onnx
β βββ codec_embedding.onnx
β βββ code_predictor_embed_g*.onnx
β βββ model_cache.npz
β βββ *.weight
βββ onnx/ # FP32 ONNX models (alternative)
βββ (same structure as fp16/)
Model Size
| Variant | Approx. Size | Note |
|---|---|---|
| FP16 only | ~X GB | Recommended, smaller and faster |
| FP32 only | ~X GB | Higher precision, larger |
| Both (full repo) | ~X GB | Includes both variants |
Tip: If you only need FP16 (recommended), you can selectively download:
huggingface-cli download <REPO_ID> --include "fp16/**" "tokenizer/**" "*.json" "*.py" "*.txt" --local-dir ./model
Quick Start
1. Install Dependencies
pip install onnxruntime numpy librosa soundfile tokenizers
For GPU acceleration:
pip install onnxruntime-gpu numpy librosa soundfile tokenizers
2. Download the Model
# Using huggingface-cli
pip install -U "huggingface_hub[cli]"
huggingface-cli download <YOUR_HF_REPO_ID> --local-dir ./model
# Or using git lfs
git lfs install
git clone https://huggingface.co/<YOUR_HF_REPO_ID> ./model
3. Three-Step Pipeline
The inference pipeline consists of three independent scripts. Steps 1 and 2 only need to be run once; Step 3 is the actual synthesis.
Step 1: Generate Global Cache (one-time)
Pre-compute constant embeddings shared across all speakers and texts. The cache files (model_cache.npz) are already included in this repository, so you can skip this step unless you re-export the models.
python generate_cache.py --model_dir ./model
Step 2: Create Speaker Profile (once per voice)
Create a reusable speaker profile from either a reference audio or a voice description.
Option A β Voice Clone (from reference audio):
python create_speaker.py \
--model_dir ./model \
--ref_audio reference.wav \
--ref_text "Transcript of the reference audio" \
--language english \
--output ./speakers/my_voice.npz
Option B β Voice Design then Clone (from text description):
python create_speaker.py \
--model_dir ./model \
--instruct "A warm, gentle young female voice" \
--design_text "Hello, this is a sample of my voice." \
--language chinese \
--output ./speakers/designed_voice.npz
The speaker profile (.npz) contains pre-computed speaker embeddings and codec features, making subsequent synthesis fast.
Step 3: Synthesize Speech
Using a speaker profile (Voice Clone):
python synthesize.py \
--model_dir ./model \
--speaker ./speakers/my_voice.npz \
--text "The weather is wonderful today." \
--output output.wav
Direct Voice Design (no speaker profile needed):
python synthesize.py \
--model_dir ./model \
--instruct "Speak in a cheerful, energetic young male voice" \
--text "The weather is wonderful today." \
--output output.wav
Generation Parameters
All synthesis commands support these optional parameters:
| Parameter | Default | Description |
|---|---|---|
--language |
chinese |
Language: chinese, english, japanese, korean, auto |
--precision |
fp16 |
Model precision: fp16, fp32 |
--temperature |
0.9 |
Sampling temperature (higher = more varied) |
--top_k |
50 |
Top-K sampling |
--top_p |
1.0 |
Top-P (nucleus) sampling |
--repetition_penalty |
1.05 |
Repetition penalty (>1.0 to reduce repeats) |
--max_tokens |
2048 |
Maximum generated tokens |
--seed |
None |
Random seed for reproducibility |
--use_gpu |
False |
Use GPU acceleration |
Pipeline Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 1: generate_cache.py (one-time setup) β
β Computes model-level constant embeddings β
β β model_cache.npz β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 2: create_speaker.py (once per voice) β
β Mode A: ref_audio + ref_text β speaker.npz β
β Mode B: instruct β voice design β speaker.npz β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 3: synthesize.py (each synthesis request) β
β speaker.npz + text β audio.wav (Voice Clone) β
β instruct + text β audio.wav (Voice Design) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Details
This repository contains ONNX exports of the following Qwen3-TTS components:
| Component | Description |
|---|---|
| Speaker Encoder | Extracts speaker embedding (x-vector) from reference audio |
| Speech Tokenizer | Encoder/decoder for audio β discrete codec codes |
| Text Embedding | Maps text token IDs to embedding vectors |
| Codec Embedding | Maps codec token IDs to embedding vectors |
| Talker (Voice Clone) | Autoregressive transformer for voice clone synthesis |
| Talker (Voice Design) | Autoregressive transformer for voice design synthesis |
| Code Predictor | Predicts multi-codebook codes from talker hidden states |
Supported Languages
Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Requirements
- Python 3.9+
onnxruntime>= 1.16 (oronnxruntime-gpufor GPU)numpy>= 1.24librosa>= 0.10soundfile>= 0.12tokenizers>= 0.15 (recommended, pure Rust, fast) ortiktoken(fallback)
Credits
This is an ONNX conversion of Qwen3-TTS by the Qwen team at Alibaba. All credit for the model architecture, training, and research goes to the original authors.
Citation
If you use this model, please cite the original paper:
@article{Qwen3-TTS,
title={Qwen3-TTS Technical Report},
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}
License
Please refer to the original Qwen3-TTS license for usage terms.
- Downloads last month
- 43
Model tree for Soundly/Qwen3-TTS-12Hz-1.7B-ONNX-INT4
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-Base