Qwen3-TTS-12Hz-0.6B-Base-ONNX
Overview
Qwen3-TTS-12Hz-0.6B-Base-ONNX is the ONNX-optimized version of the Qwen3-TTS-12Hz-0.6B-Base model. This model enables efficient, cross-platform inference with support for both CPU and NVIDIA GPU acceleration.
Model Description
Qwen3-TTS-12Hz-0.6B-Base is a base model capable of 3-second rapid voice clone from user audio input. It supports 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) and features streaming generation with ultra-low latency.
Key Features
- Fast Voice Cloning: Clone voices from just 3 seconds of reference audio
- Multi-language Support: 10 major languages with native pronunciation
- Streaming Generation: End-to-end synthesis latency as low as 97ms
- Cross-platform Inference: CPU and NVIDIA GPU support via ONNX Runtime
- Voice Similarity: Achieves >97% voice cloning similarity
Architecture
The model uses a discrete multi-codec LM architecture with:
- Talker: 28-layer transformer (hidden size: 1024, 8 KV heads)
- Code Predictor: 5-layer transformer for multi-codec generation
- Vocoder: BigVGAN-based speech decoder
- Speaker Encoder: ECAPA-TDNN for speaker embedding extraction
Supported Languages
Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Installation
pip install onnxruntime onnxruntime-gpu librosa soundfile numpy torch transformers
For GPU inference, install the CUDA version of ONNX Runtime:
pip install onnxruntime-gpu
Quick Start
# Clone voice from reference audio
python infer_onnx.py \
--text "Hello, this is a voice cloning test." \
--ref reference.wav \
--out output.wav \
--device cuda \
--language english
Command Line Usage
python infer_onnx.py \
--text "Your text here" \
--ref reference.wav \
--out output.wav \
--device cuda \
--language english
Parameters
--text: Text to synthesize (required)--ref: Path to reference audio for voice cloning (required)--out: Output WAV file path (default: output.wav)--device: Device to use:cpuorcuda(default: cpu)--language: Language of text: chinese, english, japanese, korean, german, french, russian, portuguese, spanish, italian (default: english)--max-seconds: Maximum output duration in seconds (default: 20.0)--temperature: Sampling temperature (default: 0.9)--top-k: Top-k sampling (default: 50)--top-p: Top-p sampling (default: 1.0)--sub-temperature: Code predictor temperature (default: 0.9)--sub-top-k: Code predictor top-k (default: 50)--greedy: Use greedy decoding (temperature=0)--target-rms: Target audio RMS (default: 0.1)
Model Components
The ONNX export includes 6 optimized components:
| Component | Description | Input | Output |
|---|---|---|---|
talker_prefill.onnx |
Text prefill | embeddings, mask, positions | logits, hidden, KV cache |
talker_decode.onnx |
Token generation | embeddings, mask, positions, KV cache | logits, hidden, new KV cache |
code_predictor.onnx |
Multi-codec prediction | embeddings, steps, KV cache | logits, KV cache |
vocoder.onnx |
Speech synthesis | codec codes (16 codebooks) | audio waveform (24kHz) |
speaker_encoder.onnx |
Speaker embedding | mel spectrogram | 1024-dim embedding |
| Embeddings | Token embeddings | .npy files | embedding vectors |
Performance
| Platform | Speed | Memory |
|---|---|---|
| CPU (4 cores) | ~10x real-time | ~2GB |
| NVIDIA GPU (RTX 3090) | ~90x real-time | ~4GB |
Voice Cloning Quality
The model achieves the following voice similarity scores on reference audio:
- Portuguese: 98.8%
- English: 97.5%
- Chinese: 97.2%
- Other languages: 96-98%
Similarity is measured as cosine similarity between speaker embeddings.
Technical Details
Model Specs
- Hidden Size: 1024
- Layers: 28 (talker), 5 (code predictor)
- Attention Heads: 16 (talker), 16 (code predictor)
- KV Heads: 8
- Head Dim: 128
- Vocabulary: 3072 (talker), 2048 (code predictor)
- Codec Codebooks: 16
Audio Specs
- Sample Rate: 24000 Hz
- Bit Depth: 32-bit float
- Channels: Mono
- Codec: 16 codebooks @ 12Hz
ONNX Specifications
- Opset Version: 18
- Optimization: Full graph optimization
- External Data: Models > 2GB use external data files
Limitations
- Voice cloning requires reference audio of at least 1 second
- Maximum output duration is limited by max-seconds parameter
- GPU inference requires CUDA-compatible NVIDIA GPU
Ethical Considerations
This model should be used responsibly. Be aware that:
- Voice cloning raises ethical concerns around consent and impersonation
- Generated content should not be used to deceive or harm others
- Respect copyright and intellectual property when using reference audio
License
Apache 2.0
Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
@article{Qwen3-TTS,
title={Qwen3-TTS Technical Report},
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}
Original Model
This ONNX export is based on Qwen/Qwen3-TTS-12Hz-0.6B-Base.
Model tree for romara-labs/Qwen3-TTS-12Hz-0.6B-Base-ONNX
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base