Qwen3-TTS-12Hz-1.7B-Base β€” MLX fp32

Full-precision (float32) MLX conversion of Qwen/Qwen3-TTS-12Hz-1.7B-Base for Apple Silicon.

Converted using mlx-audio 0.3.2 (from GitHub main, 2026-02-08).

Why fp32?

The existing mlx-community conversions are bf16. On Apple Silicon (tested on M5 MacBook Pro):

  • fp32 and bf16 run at the same speed β€” the model is memory-bandwidth-bound during decode, not compute-bound
  • fp32 produces noticeably better audio quality β€” more precise weights = cleaner voice cloning

There's no reason to use bf16 on Apple Silicon for this model. You get free quality improvement with zero speed penalty.

Benchmark (M5 MacBook Pro, macOS 26.2)

Model Wall time Audio duration RTF Quality
1.7B fp32 (this) ~25s ~9s 2.8x Best
1.7B bf16 ~25s ~9s 2.8x Good
0.6B fp32 ~15s ~8s 1.8x Noticeably worse

RTF = Real-Time Factor (lower = faster). All models generate slower than real-time on Apple Silicon β€” TTS autoregressive decode is memory-bandwidth-bound (M5: 153 GB/s). Qwen's paper reports RTF=0.31x but that's on NVIDIA GPU with vLLM + torch.compile + CUDA Graph.

Usage

Requires mlx-audio>=0.3.2 and transformers==5.0.0rc3:

pip install git+https://github.com/Blaizzy/mlx-audio.git transformers==5.0.0rc3
from mlx_audio.tts.utils import load_model
import mlx.core as mx
import numpy as np
import soundfile as sf

model = load_model("cr2k2/Qwen3-TTS-12Hz-1.7B-Base-fp32")

results = list(model.generate(
    text="Hello, this is a test of voice cloning.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
))

audio = np.concatenate([np.array(r.audio, copy=False) for r in results])
sf.write("output.wav", audio.astype(np.float32), 24000)

Conversion

Converted from the original HuggingFace weights using mlx-audio 0.3.2 (installed from GitHub main):

pip install git+https://github.com/Blaizzy/mlx-audio.git
python -m mlx_audio.convert --hf-path Qwen/Qwen3-TTS-12Hz-1.7B-Base --mlx-path models/Qwen3-TTS-12Hz-1.7B-Base-fp32 --dtype float32

Previously converted with an older mlx-audio version. Reconverted on 2026-02-08 with 0.3.2 to pick up fixes including #407 (convert skipping subdirs, voice cloning silence fix).

Model Details

  • Base model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
  • Format: MLX safetensors (float32)
  • Size: ~4.2 GB
  • Architecture: Qwen3-TTS dual-track LM with 12.5Hz multi-codebook tokenizer
  • Parameters: 1.7B
  • Languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Capabilities: Voice cloning (3-second reference), multilingual TTS
  • License: Apache 2.0
Downloads last month
55
Safetensors
Model size
2B params
Tensor type
BF16
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cr2k2/Qwen3-TTS-12Hz-1.7B-Base-fp32

Finetuned
(18)
this model