Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite

An optimized version of Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice for efficient on-device inference with MLX on Apple Silicon.

Research

Efficient On-Device Text-to-Speech: A Post-Training Compression Pipeline for Qwen3 TTS on Apple Silicon

We present a compression pipeline that reduces Qwen3 TTS from 2.35 GB to 808 MB (67% reduction) while preserving audio quality. Techniques include vocabulary pruning via token map indirection, speech tokenizer pruning, and 4-bit quantization.

Read the paper | PDF

Optimizations

This model applies two pruning techniques to reduce size by ~65% while preserving full audio quality:

Optimization	Before	After	Savings
Vocabulary Pruning	151,936 tokens (622 MB)	47,427 tokens (194 MB)	428 MB (69%)
SpeechTokenizer Lite	Encoder + Decoder, fp32 (651 MB)	Decoder-only, fp16 (218 MB)	433 MB (67%)
Total	2.35 GB	1.5 GB	~860 MB (36%)

Vocabulary Pruning

The original Qwen3 TTS inherits a full 152K multilingual LLM vocabulary, but TTS only uses ~47K tokens. We extract only the needed embedding rows and use a text_token_map for ID remapping at inference time. This is lossless - the kept embeddings are identical to the originals.

SpeechTokenizer Lite

The SpeechTokenizer encoder (used only for voice cloning) is stripped since CustomVoice mode doesn't need it. Decoder weights are converted from float32 to float16. This is also lossless for audio generation.

Model Details

Architecture: Qwen3TTSForConditionalGeneration
Precision: bfloat16
Codec Rate: 12.5 Hz (80ms per token)
Audio Output: 24kHz mono WAV
Talker: 28 layers, 1024 hidden, 16 heads (GQA 8 KV heads)
CodePredictor: 5 layers, 16 codebook groups
RoPE: M-RoPE with sections [24, 20, 20]

Speakers

Speaker	Languages
Aiden	Multilingual
Serena	Multilingual
Vivian	Multilingual
Ryan	Multilingual
Uncle Fu	Multilingual
Ono Anna	Multilingual
Sohee	Multilingual
Eric	Sichuan Dialect
Dylan	Beijing Dialect

Supported Languages

English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Beijing Dialect, Sichuan Dialect

Usage with Swift (Apple Silicon)

This model is designed for use with swift-qwen3-tts, a native Swift implementation using Apple MLX.

git clone https://github.com/AtomGradient/swift-qwen3-tts.git
cd swift-qwen3-tts

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite \
  --speaker Aiden \
  --text "Hello, my name is Aiden. Nice to meet you!" \
  --output output.wav

With Emotion Control

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite \
  --speaker Vivian \
  --instruct "Very happy and excited." \
  --text "I just got the best news ever!" \
  --output happy.wav

Performance

Tested on Apple Silicon (M-series):

Metric	Value
Peak Memory	~2.4 GB
Real-time Factor	~0.8x (faster than real-time)
Model Load Time	~2.5s

Acknowledgments

Based on Qwen3-TTS by the Qwen Team at Alibaba Cloud.

Model tree for AtomGradient/Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Finetuned

(4)

this model

AtomGradient
/

Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite