Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite
A heavily optimized version of Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice for efficient on-device inference with MLX on Apple Silicon. Combines vocabulary pruning, SpeechTokenizer pruning, and 4-bit quantization for maximum size reduction.
Research
We present a compression pipeline that reduces Qwen3 TTS from 2.35 GB to 808 MB (67% reduction) while preserving audio quality. Techniques include vocabulary pruning via token map indirection, speech tokenizer pruning, and 4-bit quantization.
Optimizations
Three techniques are applied to reduce total size by ~67%:
| Optimization | Before | After | Savings |
|---|---|---|---|
| Vocabulary Pruning | 151,936 tokens (622 MB) | 47,427 tokens (194 MB) | 428 MB (69%) |
| SpeechTokenizer Lite | Encoder + Decoder, fp32 (651 MB) | Decoder-only, fp16 (218 MB) | 433 MB (67%) |
| 4-bit Quantization | bf16 linear layers (1.3 GB) | 4-bit affine (552 MB) | ~750 MB (58%) |
| Total | 2.35 GB | 780 MB | ~1.57 GB (67%) |
4-bit Quantization Details
- Method: Affine quantization (min/max scaling per group)
- Group Size: 64
- Scope: 249 linear layers quantized, 17 embedding layers kept in bf16
- Embeddings preserved:
text_embedding,codec_embedding, andspeech_tokenizerweights remain in bf16 to preserve speech quality and pacing
Quantizing embedding layers can cause speech pacing issues (longer audio output). All embedding layers are intentionally kept in bf16.
Vocabulary Pruning
The original Qwen3 TTS inherits a full 152K multilingual LLM vocabulary, but TTS only uses ~47K tokens. We extract only the needed embedding rows and use a text_token_map for ID remapping at inference time. This is lossless - the kept embeddings are identical to the originals.
SpeechTokenizer Lite
The SpeechTokenizer encoder (used only for voice cloning) is stripped since CustomVoice mode doesn't need it. Decoder weights are converted from float32 to float16.
Model Details
- Architecture: Qwen3TTSForConditionalGeneration
- Precision: 4-bit (linear layers), bf16 (embeddings)
- Codec Rate: 12.5 Hz (80ms per token)
- Audio Output: 24kHz mono WAV
- Talker: 28 layers, 1024 hidden, 16 heads (GQA 8 KV heads)
- CodePredictor: 5 layers, 16 codebook groups
- RoPE: M-RoPE with sections [24, 20, 20]
Speakers
| Speaker | Languages |
|---|---|
| Aiden | Multilingual |
| Serena | Multilingual |
| Vivian | Multilingual |
| Ryan | Multilingual |
| Uncle Fu | Multilingual |
| Ono Anna | Multilingual |
| Sohee | Multilingual |
| Eric | Sichuan Dialect |
| Dylan | Beijing Dialect |
Supported Languages
English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Beijing Dialect, Sichuan Dialect
Usage with Swift (Apple Silicon)
This model is designed for use with swift-qwen3-tts, a native Swift implementation using Apple MLX.
git clone https://github.com/AtomGradient/swift-qwen3-tts.git
cd swift-qwen3-tts
swift run Qwen3TTSDemo \
--model /path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
--speaker Aiden \
--text "Hello, my name is Aiden. Nice to meet you!" \
--output output.wav
With Emotion Control
swift run Qwen3TTSDemo \
--model /path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
--speaker Vivian \
--instruct "Very happy and excited." \
--text "I just got the best news ever!" \
--output happy.wav
Performance
Tested on Apple Silicon (M-series):
| Metric | Value |
|---|---|
| Peak Memory | ~2.4 GB |
| Real-time Factor | ~0.8x (faster than real-time) |
| Model Load Time | ~2.5s |
| Audio Quality | Near-identical to bf16 |
Audio Quality Notes
4-bit quantization introduces minimal quality degradation. In A/B testing, the audio is perceptually indistinguishable from the bf16 version. Due to stochastic sampling (temperature=0.9), audio length varies naturally between runs (typically 3-7 seconds for a short sentence), which is normal behavior and not related to quantization.
Acknowledgments
Based on Qwen3-TTS by the Qwen Team at Alibaba Cloud.
Links
- Swift Inference Engine: swift-qwen3-tts
- bf16 Version: Qwen3-TTS-0.6B-CustomVoice-bf16-pruned-vocab-lite
- Original Model: Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
- Downloads last month
- 91
Model tree for AtomGradient/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice