Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite

A heavily optimized version of Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice for efficient on-device inference with MLX on Apple Silicon. Combines vocabulary pruning, SpeechTokenizer pruning, and 4-bit quantization for maximum size reduction.

Research

Efficient On-Device Text-to-Speech: A Post-Training Compression Pipeline for Qwen3 TTS on Apple Silicon

We present a compression pipeline that reduces Qwen3 TTS from 2.35 GB to 808 MB (67% reduction) while preserving audio quality. Techniques include vocabulary pruning via token map indirection, speech tokenizer pruning, and 4-bit quantization.

Read the paper | PDF

Optimizations

Three techniques are applied to reduce total size by ~67%:

Optimization	Before	After	Savings
Vocabulary Pruning	151,936 tokens (622 MB)	47,427 tokens (194 MB)	428 MB (69%)
SpeechTokenizer Lite	Encoder + Decoder, fp32 (651 MB)	Decoder-only, fp16 (218 MB)	433 MB (67%)
4-bit Quantization	bf16 linear layers (1.3 GB)	4-bit affine (552 MB)	~750 MB (58%)
Total	2.35 GB	780 MB	~1.57 GB (67%)

4-bit Quantization Details

Method: Affine quantization (min/max scaling per group)
Group Size: 64
Scope: 249 linear layers quantized, 17 embedding layers kept in bf16
Embeddings preserved: text_embedding, codec_embedding, and speech_tokenizer weights remain in bf16 to preserve speech quality and pacing

Quantizing embedding layers can cause speech pacing issues (longer audio output). All embedding layers are intentionally kept in bf16.

Vocabulary Pruning

The original Qwen3 TTS inherits a full 152K multilingual LLM vocabulary, but TTS only uses ~47K tokens. We extract only the needed embedding rows and use a text_token_map for ID remapping at inference time. This is lossless - the kept embeddings are identical to the originals.

SpeechTokenizer Lite

The SpeechTokenizer encoder (used only for voice cloning) is stripped since CustomVoice mode doesn't need it. Decoder weights are converted from float32 to float16.

Model Details

Architecture: Qwen3TTSForConditionalGeneration
Precision: 4-bit (linear layers), bf16 (embeddings)
Codec Rate: 12.5 Hz (80ms per token)
Audio Output: 24kHz mono WAV
Talker: 28 layers, 1024 hidden, 16 heads (GQA 8 KV heads)
CodePredictor: 5 layers, 16 codebook groups
RoPE: M-RoPE with sections [24, 20, 20]

Speakers

Speaker	Languages
Aiden	Multilingual
Serena	Multilingual
Vivian	Multilingual
Ryan	Multilingual
Uncle Fu	Multilingual
Ono Anna	Multilingual
Sohee	Multilingual
Eric	Sichuan Dialect
Dylan	Beijing Dialect

Supported Languages

English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Beijing Dialect, Sichuan Dialect

Usage with Swift (Apple Silicon)

This model is designed for use with swift-qwen3-tts, a native Swift implementation using Apple MLX.

git clone https://github.com/AtomGradient/swift-qwen3-tts.git
cd swift-qwen3-tts

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
  --speaker Aiden \
  --text "Hello, my name is Aiden. Nice to meet you!" \
  --output output.wav

With Emotion Control

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
  --speaker Vivian \
  --instruct "Very happy and excited." \
  --text "I just got the best news ever!" \
  --output happy.wav

Performance

Tested on Apple Silicon (M-series):

Metric	Value
Peak Memory	~2.4 GB
Real-time Factor	~0.8x (faster than real-time)
Model Load Time	~2.5s
Audio Quality	Near-identical to bf16

Audio Quality Notes

4-bit quantization introduces minimal quality degradation. In A/B testing, the audio is perceptually indistinguishable from the bf16 version. Due to stochastic sampling (temperature=0.9), audio length varies naturally between runs (typically 3-7 seconds for a short sentence), which is normal behavior and not related to quantization.

Acknowledgments

Based on Qwen3-TTS by the Qwen Team at Alibaba Cloud.

Model tree for AtomGradient/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Quantized

(6)

this model

AtomGradient
/

Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite