Qwen3-TTS-12Hz-0.6B-Base-ONNX

Overview

Qwen3-TTS-12Hz-0.6B-Base-ONNX is the ONNX-optimized version of the Qwen3-TTS-12Hz-0.6B-Base model. This model enables efficient, cross-platform inference with support for both CPU and NVIDIA GPU acceleration.

Model Description

Qwen3-TTS-12Hz-0.6B-Base is a base model capable of 3-second rapid voice clone from user audio input. It supports 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) and features streaming generation with ultra-low latency.

Key Features

  • Fast Voice Cloning: Clone voices from just 3 seconds of reference audio
  • Multi-language Support: 10 major languages with native pronunciation
  • Streaming Generation: End-to-end synthesis latency as low as 97ms
  • Cross-platform Inference: CPU and NVIDIA GPU support via ONNX Runtime
  • Voice Similarity: Achieves >97% voice cloning similarity

Architecture

The model uses a discrete multi-codec LM architecture with:

  • Talker: 28-layer transformer (hidden size: 1024, 8 KV heads)
  • Code Predictor: 5-layer transformer for multi-codec generation
  • Vocoder: BigVGAN-based speech decoder
  • Speaker Encoder: ECAPA-TDNN for speaker embedding extraction

Supported Languages

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Installation

pip install onnxruntime onnxruntime-gpu librosa soundfile numpy torch transformers

For GPU inference, install the CUDA version of ONNX Runtime:

pip install onnxruntime-gpu

Quick Start

# Clone voice from reference audio
python infer_onnx.py \
    --text "Hello, this is a voice cloning test." \
    --ref reference.wav \
    --out output.wav \
    --device cuda \
    --language english

Command Line Usage

python infer_onnx.py \
    --text "Your text here" \
    --ref reference.wav \
    --out output.wav \
    --device cuda \
    --language english

Parameters

  • --text: Text to synthesize (required)
  • --ref: Path to reference audio for voice cloning (required)
  • --out: Output WAV file path (default: output.wav)
  • --device: Device to use: cpu or cuda (default: cpu)
  • --language: Language of text: chinese, english, japanese, korean, german, french, russian, portuguese, spanish, italian (default: english)
  • --max-seconds: Maximum output duration in seconds (default: 20.0)
  • --temperature: Sampling temperature (default: 0.9)
  • --top-k: Top-k sampling (default: 50)
  • --top-p: Top-p sampling (default: 1.0)
  • --sub-temperature: Code predictor temperature (default: 0.9)
  • --sub-top-k: Code predictor top-k (default: 50)
  • --greedy: Use greedy decoding (temperature=0)
  • --target-rms: Target audio RMS (default: 0.1)

Model Components

The ONNX export includes 6 optimized components:

Component Description Input Output
talker_prefill.onnx Text prefill embeddings, mask, positions logits, hidden, KV cache
talker_decode.onnx Token generation embeddings, mask, positions, KV cache logits, hidden, new KV cache
code_predictor.onnx Multi-codec prediction embeddings, steps, KV cache logits, KV cache
vocoder.onnx Speech synthesis codec codes (16 codebooks) audio waveform (24kHz)
speaker_encoder.onnx Speaker embedding mel spectrogram 1024-dim embedding
Embeddings Token embeddings .npy files embedding vectors

Performance

Platform Speed Memory
CPU (4 cores) ~10x real-time ~2GB
NVIDIA GPU (RTX 3090) ~90x real-time ~4GB

Voice Cloning Quality

The model achieves the following voice similarity scores on reference audio:

  • Portuguese: 98.8%
  • English: 97.5%
  • Chinese: 97.2%
  • Other languages: 96-98%

Similarity is measured as cosine similarity between speaker embeddings.

Technical Details

Model Specs

  • Hidden Size: 1024
  • Layers: 28 (talker), 5 (code predictor)
  • Attention Heads: 16 (talker), 16 (code predictor)
  • KV Heads: 8
  • Head Dim: 128
  • Vocabulary: 3072 (talker), 2048 (code predictor)
  • Codec Codebooks: 16

Audio Specs

  • Sample Rate: 24000 Hz
  • Bit Depth: 32-bit float
  • Channels: Mono
  • Codec: 16 codebooks @ 12Hz

ONNX Specifications

  • Opset Version: 18
  • Optimization: Full graph optimization
  • External Data: Models > 2GB use external data files

Limitations

  • Voice cloning requires reference audio of at least 1 second
  • Maximum output duration is limited by max-seconds parameter
  • GPU inference requires CUDA-compatible NVIDIA GPU

Ethical Considerations

This model should be used responsibly. Be aware that:

  • Voice cloning raises ethical concerns around consent and impersonation
  • Generated content should not be used to deceive or harm others
  • Respect copyright and intellectual property when using reference audio

License

Apache 2.0

Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

@article{Qwen3-TTS,
  title={Qwen3-TTS Technical Report},
  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

Original Model

This ONNX export is based on Qwen/Qwen3-TTS-12Hz-0.6B-Base.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for romara-labs/Qwen3-TTS-12Hz-0.6B-Base-ONNX

Quantized
(12)
this model

Paper for romara-labs/Qwen3-TTS-12Hz-0.6B-Base-ONNX