Qwen3-TTS-12Hz-0.6B-Base-ONNX

Overview

Qwen3-TTS-12Hz-0.6B-Base-ONNX is the ONNX-optimized version of the Qwen3-TTS-12Hz-0.6B-Base model. This model enables efficient, cross-platform inference with support for both CPU and NVIDIA GPU acceleration.

Model Description

Qwen3-TTS-12Hz-0.6B-Base is a base model capable of 3-second rapid voice clone from user audio input. It supports 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) and features streaming generation with ultra-low latency.

Key Features

Fast Voice Cloning: Clone voices from just 3 seconds of reference audio
Multi-language Support: 10 major languages with native pronunciation
Streaming Generation: End-to-end synthesis latency as low as 97ms
Cross-platform Inference: CPU and NVIDIA GPU support via ONNX Runtime
Voice Similarity: Achieves >97% voice cloning similarity

Architecture

The model uses a discrete multi-codec LM architecture with:

Talker: 28-layer transformer (hidden size: 1024, 8 KV heads)
Code Predictor: 5-layer transformer for multi-codec generation
Vocoder: BigVGAN-based speech decoder
Speaker Encoder: ECAPA-TDNN for speaker embedding extraction

Supported Languages

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Installation

pip install onnxruntime onnxruntime-gpu librosa soundfile numpy torch transformers

For GPU inference, install the CUDA version of ONNX Runtime:

pip install onnxruntime-gpu

Quick Start

# Clone voice from reference audio
python infer_onnx.py \
    --text "Hello, this is a voice cloning test." \
    --ref reference.wav \
    --out output.wav \
    --device cuda \
    --language english

Command Line Usage

python infer_onnx.py \
    --text "Your text here" \
    --ref reference.wav \
    --out output.wav \
    --device cuda \
    --language english

Parameters

--text: Text to synthesize (required)
--ref: Path to reference audio for voice cloning (required)
--out: Output WAV file path (default: output.wav)
--device: Device to use: cpu or cuda (default: cpu)
--language: Language of text: chinese, english, japanese, korean, german, french, russian, portuguese, spanish, italian (default: english)
--max-seconds: Maximum output duration in seconds (default: 20.0)
--temperature: Sampling temperature (default: 0.9)
--top-k: Top-k sampling (default: 50)
--top-p: Top-p sampling (default: 1.0)
--sub-temperature: Code predictor temperature (default: 0.9)
--sub-top-k: Code predictor top-k (default: 50)
--greedy: Use greedy decoding (temperature=0)
--target-rms: Target audio RMS (default: 0.1)

Model Components

The ONNX export includes 6 optimized components:

Component	Description	Input	Output
`talker_prefill.onnx`	Text prefill	embeddings, mask, positions	logits, hidden, KV cache
`talker_decode.onnx`	Token generation	embeddings, mask, positions, KV cache	logits, hidden, new KV cache
`code_predictor.onnx`	Multi-codec prediction	embeddings, steps, KV cache	logits, KV cache
`vocoder.onnx`	Speech synthesis	codec codes (16 codebooks)	audio waveform (24kHz)
`speaker_encoder.onnx`	Speaker embedding	mel spectrogram	1024-dim embedding
Embeddings	Token embeddings	.npy files	embedding vectors

Performance

Platform	Speed	Memory
CPU (4 cores)	~10x real-time	~2GB
NVIDIA GPU (RTX 3090)	~90x real-time	~4GB

Voice Cloning Quality

The model achieves the following voice similarity scores on reference audio:

Portuguese: 98.8%
English: 97.5%
Chinese: 97.2%
Other languages: 96-98%

Similarity is measured as cosine similarity between speaker embeddings.

Technical Details

Model Specs

Hidden Size: 1024
Layers: 28 (talker), 5 (code predictor)
Attention Heads: 16 (talker), 16 (code predictor)
KV Heads: 8
Head Dim: 128
Vocabulary: 3072 (talker), 2048 (code predictor)
Codec Codebooks: 16

Audio Specs

Sample Rate: 24000 Hz
Bit Depth: 32-bit float
Channels: Mono
Codec: 16 codebooks @ 12Hz

ONNX Specifications

Opset Version: 18
Optimization: Full graph optimization
External Data: Models > 2GB use external data files

Limitations

Voice cloning requires reference audio of at least 1 second
Maximum output duration is limited by max-seconds parameter
GPU inference requires CUDA-compatible NVIDIA GPU

Ethical Considerations

This model should be used responsibly. Be aware that:

Voice cloning raises ethical concerns around consent and impersonation
Generated content should not be used to deceive or harm others
Respect copyright and intellectual property when using reference audio

License

Apache 2.0

Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

@article{Qwen3-TTS,
  title={Qwen3-TTS Technical Report},
  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

Original Model

This ONNX export is based on Qwen/Qwen3-TTS-12Hz-0.6B-Base.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for romara-labs/Qwen3-TTS-12Hz-0.6B-Base-ONNX

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-Base

Quantized

(12)

this model

Paper for romara-labs/Qwen3-TTS-12Hz-0.6B-Base-ONNX

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 75