Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite

A heavily optimized version of Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice for efficient on-device inference with MLX on Apple Silicon. Combines vocabulary pruning, SpeechTokenizer pruning, and 4-bit quantization for maximum size reduction.

Research

Efficient On-Device Text-to-Speech: A Post-Training Compression Pipeline for Qwen3 TTS on Apple Silicon

We present a compression pipeline that reduces Qwen3 TTS from 2.35 GB to 808 MB (67% reduction) while preserving audio quality. Techniques include vocabulary pruning via token map indirection, speech tokenizer pruning, and 4-bit quantization.

Read the paper | PDF

Optimizations

Three techniques are applied to reduce total size by ~67%:

Optimization Before After Savings
Vocabulary Pruning 151,936 tokens (622 MB) 47,427 tokens (194 MB) 428 MB (69%)
SpeechTokenizer Lite Encoder + Decoder, fp32 (651 MB) Decoder-only, fp16 (218 MB) 433 MB (67%)
4-bit Quantization bf16 linear layers (1.3 GB) 4-bit affine (552 MB) ~750 MB (58%)
Total 2.35 GB 780 MB ~1.57 GB (67%)

4-bit Quantization Details

  • Method: Affine quantization (min/max scaling per group)
  • Group Size: 64
  • Scope: 249 linear layers quantized, 17 embedding layers kept in bf16
  • Embeddings preserved: text_embedding, codec_embedding, and speech_tokenizer weights remain in bf16 to preserve speech quality and pacing

Quantizing embedding layers can cause speech pacing issues (longer audio output). All embedding layers are intentionally kept in bf16.

Vocabulary Pruning

The original Qwen3 TTS inherits a full 152K multilingual LLM vocabulary, but TTS only uses ~47K tokens. We extract only the needed embedding rows and use a text_token_map for ID remapping at inference time. This is lossless - the kept embeddings are identical to the originals.

SpeechTokenizer Lite

The SpeechTokenizer encoder (used only for voice cloning) is stripped since CustomVoice mode doesn't need it. Decoder weights are converted from float32 to float16.

Model Details

  • Architecture: Qwen3TTSForConditionalGeneration
  • Precision: 4-bit (linear layers), bf16 (embeddings)
  • Codec Rate: 12.5 Hz (80ms per token)
  • Audio Output: 24kHz mono WAV
  • Talker: 28 layers, 1024 hidden, 16 heads (GQA 8 KV heads)
  • CodePredictor: 5 layers, 16 codebook groups
  • RoPE: M-RoPE with sections [24, 20, 20]

Speakers

Speaker Languages
Aiden Multilingual
Serena Multilingual
Vivian Multilingual
Ryan Multilingual
Uncle Fu Multilingual
Ono Anna Multilingual
Sohee Multilingual
Eric Sichuan Dialect
Dylan Beijing Dialect

Supported Languages

English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Beijing Dialect, Sichuan Dialect

Usage with Swift (Apple Silicon)

This model is designed for use with swift-qwen3-tts, a native Swift implementation using Apple MLX.

git clone https://github.com/AtomGradient/swift-qwen3-tts.git
cd swift-qwen3-tts

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
  --speaker Aiden \
  --text "Hello, my name is Aiden. Nice to meet you!" \
  --output output.wav

With Emotion Control

swift run Qwen3TTSDemo \
  --model /path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
  --speaker Vivian \
  --instruct "Very happy and excited." \
  --text "I just got the best news ever!" \
  --output happy.wav

Performance

Tested on Apple Silicon (M-series):

Metric Value
Peak Memory ~2.4 GB
Real-time Factor ~0.8x (faster than real-time)
Model Load Time ~2.5s
Audio Quality Near-identical to bf16

Audio Quality Notes

4-bit quantization introduces minimal quality degradation. In A/B testing, the audio is perceptually indistinguishable from the bf16 version. Due to stochastic sampling (temperature=0.9), audio length varies naturally between runs (typically 3-7 seconds for a short sentence), which is normal behavior and not related to quantization.

Acknowledgments

Based on Qwen3-TTS by the Qwen Team at Alibaba Cloud.

Links

Downloads last month
91
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AtomGradient/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite

Quantized
(6)
this model