Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite

An optimized version of Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice for efficient on-device inference with MLX on Apple Silicon.

Research

Efficient On-Device Text-to-Speech: A Post-Training Compression Pipeline for Qwen3 TTS on Apple Silicon

We present a compression pipeline that reduces Qwen3 TTS from 2.35 GB to 808 MB (67% reduction) while preserving audio quality. Techniques include vocabulary pruning via token map indirection, speech tokenizer pruning, and 4-bit quantization.

Read the paper | PDF

Optimizations

This model applies two pruning techniques to reduce size by ~65% while preserving full audio quality:

Optimization Before After Savings
Vocabulary Pruning 151,936 tokens (622 MB) 47,427 tokens (194 MB) 428 MB (69%)
SpeechTokenizer Lite Encoder + Decoder, fp32 (651 MB) Decoder-only, fp16 (218 MB) 433 MB (67%)
Total 2.35 GB 1.5 GB ~860 MB (36%)

Vocabulary Pruning

The original Qwen3 TTS inherits a full 152K multilingual LLM vocabulary, but TTS only uses ~47K tokens. We extract only the needed embedding rows and use a text_token_map for ID remapping at inference time. This is lossless - the kept embeddings are identical to the originals.

SpeechTokenizer Lite

The SpeechTokenizer encoder (used only for voice cloning) is stripped since CustomVoice mode doesn't need it. Decoder weights are converted from float32 to float16. This is also lossless for audio generation.

Model Details

  • Architecture: Qwen3TTSForConditionalGeneration
  • Precision: bfloat16
  • Codec Rate: 12.5 Hz (80ms per token)
  • Audio Output: 24kHz mono WAV
  • Talker: 28 layers, 1024 hidden, 16 heads (GQA 8 KV heads)
  • CodePredictor: 5 layers, 16 codebook groups
  • RoPE: M-RoPE with sections [24, 20, 20]

Speakers

Speaker Languages
Aiden Multilingual
Serena Multilingual
Vivian Multilingual
Ryan Multilingual
Uncle Fu Multilingual
Ono Anna Multilingual
Sohee Multilingual
Eric Sichuan Dialect
Dylan Beijing Dialect

Supported Languages

English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Beijing Dialect, Sichuan Dialect

Usage with Swift (Apple Silicon)

This model is designed for use with swift-qwen3-tts, a native Swift implementation using Apple MLX.

git clone https://github.com/AtomGradient/swift-qwen3-tts.git
cd swift-qwen3-tts

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite \
  --speaker Aiden \
  --text "Hello, my name is Aiden. Nice to meet you!" \
  --output output.wav

With Emotion Control

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite \
  --speaker Vivian \
  --instruct "Very happy and excited." \
  --text "I just got the best news ever!" \
  --output happy.wav

Performance

Tested on Apple Silicon (M-series):

Metric Value
Peak Memory ~2.4 GB
Real-time Factor ~0.8x (faster than real-time)
Model Load Time ~2.5s

Acknowledgments

Based on Qwen3-TTS by the Qwen Team at Alibaba Cloud.

Links

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AtomGradient/Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite

Finetuned
(4)
this model