Qwen3-TTS ONNX

ONNX-exported version of Qwen3-TTS for portable, framework-free speech synthesis. Supports Voice Clone (clone a voice from reference audio) and Voice Design (create a voice from a text description).

Note: This is a community ONNX conversion of the official Qwen3-TTS models. For the original PyTorch models, training details, benchmarks, and academic citations, please refer to the official Qwen3-TTS repository.

Features

Pure ONNX inference — no PyTorch dependency required at runtime
Voice Clone: clone any voice from a short (3s+) reference audio clip
Voice Design: create a voice from natural-language descriptions (e.g., "A warm, gentle young female voice")
Voice Design then Clone: design a voice via description, then save it as a reusable speaker profile
Pre-computed global cache: constant embeddings are pre-computed and shipped, reducing cold-start time
FP16 and FP32 models included
10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Repository Structure

.
├── README.md                       # This file
├── onnx_inference.py               # Core inference engine (library)
├── generate_cache.py               # Step 1: Generate global embedding cache
├── create_speaker.py               # Step 2: Create speaker profile
├── synthesize.py                   # Step 3: Synthesize speech
├── requirements.txt                # Python dependencies
├── voice_clone_config.json         # Model config (voice clone)
├── voice_design_config.json        # Model config (voice design)
├── dual_model_config.json          # Combined model config
├── tokenizer/                      # Tokenizer files
│   ├── tokenizer.json              # Rust tokenizer (recommended, fast)
│   ├── vocab.json                  # BPE vocabulary
│   ├── merges.txt                  # BPE merge rules
│   └── tokenizer_config.json       # Tokenizer config
├── fp16/                           # FP16 ONNX models (recommended)
│   ├── shared/                     # Shared models (speaker encoder, speech tokenizer)
│   │   ├── speaker_encoder.onnx
│   │   ├── speech_tokenizer_encoder.onnx
│   │   └── speech_tokenizer_decoder.onnx
│   ├── voice_clone/                # Voice Clone talker models + cache
│   │   ├── talker_decode.onnx
│   │   ├── code_predictor.onnx
│   │   ├── code_predictor_kv.onnx
│   │   ├── text_embedding.onnx
│   │   ├── codec_embedding.onnx
│   │   ├── code_predictor_embed_g*.onnx
│   │   ├── model_cache.npz         # Pre-computed global cache
│   │   └── *.weight                # External data files for ONNX models
│   └── voice_design/               # Voice Design talker models + cache
│       ├── talker_decode.onnx
│       ├── code_predictor.onnx
│       ├── code_predictor_kv.onnx
│       ├── text_embedding.onnx
│       ├── codec_embedding.onnx
│       ├── code_predictor_embed_g*.onnx
│       ├── model_cache.npz
│       └── *.weight
└── onnx/                           # FP32 ONNX models (alternative)
    └── (same structure as fp16/)

Model Size

Variant	Approx. Size	Note
FP16 only	~X GB	Recommended, smaller and faster
FP32 only	~X GB	Higher precision, larger
Both (full repo)	~X GB	Includes both variants

Tip: If you only need FP16 (recommended), you can selectively download:
huggingface-cli download <REPO_ID> --include "fp16/**" "tokenizer/**" "*.json" "*.py" "*.txt" --local-dir ./model

Quick Start

1. Install Dependencies

pip install onnxruntime numpy librosa soundfile tokenizers

For GPU acceleration:

pip install onnxruntime-gpu numpy librosa soundfile tokenizers

2. Download the Model

# Using huggingface-cli
pip install -U "huggingface_hub[cli]"
huggingface-cli download <YOUR_HF_REPO_ID> --local-dir ./model

# Or using git lfs
git lfs install
git clone https://huggingface.co/<YOUR_HF_REPO_ID> ./model

3. Three-Step Pipeline

The inference pipeline consists of three independent scripts. Steps 1 and 2 only need to be run once; Step 3 is the actual synthesis.

Step 1: Generate Global Cache (one-time)

Pre-compute constant embeddings shared across all speakers and texts. The cache files (model_cache.npz) are already included in this repository, so you can skip this step unless you re-export the models.

python generate_cache.py --model_dir ./model

Step 2: Create Speaker Profile (once per voice)

Create a reusable speaker profile from either a reference audio or a voice description.

Option A — Voice Clone (from reference audio):

python create_speaker.py \
    --model_dir ./model \
    --ref_audio reference.wav \
    --ref_text "Transcript of the reference audio" \
    --language english \
    --output ./speakers/my_voice.npz

Option B — Voice Design then Clone (from text description):

python create_speaker.py \
    --model_dir ./model \
    --instruct "A warm, gentle young female voice" \
    --design_text "Hello, this is a sample of my voice." \
    --language chinese \
    --output ./speakers/designed_voice.npz

The speaker profile (.npz) contains pre-computed speaker embeddings and codec features, making subsequent synthesis fast.

Step 3: Synthesize Speech

Using a speaker profile (Voice Clone):

python synthesize.py \
    --model_dir ./model \
    --speaker ./speakers/my_voice.npz \
    --text "The weather is wonderful today." \
    --output output.wav

Direct Voice Design (no speaker profile needed):

python synthesize.py \
    --model_dir ./model \
    --instruct "Speak in a cheerful, energetic young male voice" \
    --text "The weather is wonderful today." \
    --output output.wav

Generation Parameters

All synthesis commands support these optional parameters:

Parameter	Default	Description
`--language`	`chinese`	Language: chinese, english, japanese, korean, auto
`--precision`	`fp16`	Model precision: fp16, fp32
`--temperature`	`0.9`	Sampling temperature (higher = more varied)
`--top_k`	`50`	Top-K sampling
`--top_p`	`1.0`	Top-P (nucleus) sampling
`--repetition_penalty`	`1.05`	Repetition penalty (>1.0 to reduce repeats)
`--max_tokens`	`2048`	Maximum generated tokens
`--seed`	`None`	Random seed for reproducibility
`--use_gpu`	`False`	Use GPU acceleration

Pipeline Overview

┌─────────────────────────────────────────────────────┐
│  Step 1: generate_cache.py  (one-time setup)        │
│  Computes model-level constant embeddings           │
│  → model_cache.npz                                  │
└─────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│  Step 2: create_speaker.py  (once per voice)        │
│  Mode A: ref_audio + ref_text → speaker.npz         │
│  Mode B: instruct → voice design → speaker.npz      │
└─────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│  Step 3: synthesize.py  (each synthesis request)    │
│  speaker.npz + text → audio.wav  (Voice Clone)      │
│  instruct + text → audio.wav     (Voice Design)     │
└─────────────────────────────────────────────────────┘

Model Details

This repository contains ONNX exports of the following Qwen3-TTS components:

Component	Description
Speaker Encoder	Extracts speaker embedding (x-vector) from reference audio
Speech Tokenizer	Encoder/decoder for audio ↔ discrete codec codes
Text Embedding	Maps text token IDs to embedding vectors
Codec Embedding	Maps codec token IDs to embedding vectors
Talker (Voice Clone)	Autoregressive transformer for voice clone synthesis
Talker (Voice Design)	Autoregressive transformer for voice design synthesis
Code Predictor	Predicts multi-codebook codes from talker hidden states

Supported Languages

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Requirements

Python 3.9+
onnxruntime >= 1.16 (or onnxruntime-gpu for GPU)
numpy >= 1.24
librosa >= 0.10
soundfile >= 0.12
tokenizers >= 0.15 (recommended, pure Rust, fast) or tiktoken (fallback)

Credits

This is an ONNX conversion of Qwen3-TTS by the Qwen team at Alibaba. All credit for the model architecture, training, and research goes to the original authors.

Citation

If you use this model, please cite the original paper:

@article{Qwen3-TTS,
  title={Qwen3-TTS Technical Report},
  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

License

Please refer to the original Qwen3-TTS license for usage terms.

Downloads last month: 43

Model tree for Soundly/Qwen3-TTS-12Hz-1.7B-ONNX-INT4

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Quantized

(8)

this model

Paper for Soundly/Qwen3-TTS-12Hz-1.7B-ONNX-INT4

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 75