Qwen3-TTS-12Hz-1.7B-Base — MLX fp32

Full-precision (float32) MLX conversion of Qwen/Qwen3-TTS-12Hz-1.7B-Base for Apple Silicon.

Converted using mlx-audio 0.3.2 (from GitHub main, 2026-02-08).

Why fp32?

The existing mlx-community conversions are bf16. On Apple Silicon (tested on M5 MacBook Pro):

fp32 and bf16 run at the same speed — the model is memory-bandwidth-bound during decode, not compute-bound
fp32 produces noticeably better audio quality — more precise weights = cleaner voice cloning

There's no reason to use bf16 on Apple Silicon for this model. You get free quality improvement with zero speed penalty.

Benchmark (M5 MacBook Pro, macOS 26.2)

Model	Wall time	Audio duration	RTF	Quality
1.7B fp32 (this)	~25s	~9s	2.8x	Best
1.7B bf16	~25s	~9s	2.8x	Good
0.6B fp32	~15s	~8s	1.8x	Noticeably worse

RTF = Real-Time Factor (lower = faster). All models generate slower than real-time on Apple Silicon — TTS autoregressive decode is memory-bandwidth-bound (M5: 153 GB/s). Qwen's paper reports RTF=0.31x but that's on NVIDIA GPU with vLLM + torch.compile + CUDA Graph.

Usage

Requires mlx-audio>=0.3.2 and transformers==5.0.0rc3:

pip install git+https://github.com/Blaizzy/mlx-audio.git transformers==5.0.0rc3

from mlx_audio.tts.utils import load_model
import mlx.core as mx
import numpy as np
import soundfile as sf

model = load_model("cr2k2/Qwen3-TTS-12Hz-1.7B-Base-fp32")

results = list(model.generate(
    text="Hello, this is a test of voice cloning.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
))

audio = np.concatenate([np.array(r.audio, copy=False) for r in results])
sf.write("output.wav", audio.astype(np.float32), 24000)

Conversion

Converted from the original HuggingFace weights using mlx-audio 0.3.2 (installed from GitHub main):

pip install git+https://github.com/Blaizzy/mlx-audio.git
python -m mlx_audio.convert --hf-path Qwen/Qwen3-TTS-12Hz-1.7B-Base --mlx-path models/Qwen3-TTS-12Hz-1.7B-Base-fp32 --dtype float32

Previously converted with an older mlx-audio version. Reconverted on 2026-02-08 with 0.3.2 to pick up fixes including #407 (convert skipping subdirs, voice cloning silence fix).

Model Details

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
Format: MLX safetensors (float32)
Size: ~4.2 GB
Architecture: Qwen3-TTS dual-track LM with 12.5Hz multi-codebook tokenizer
Parameters: 1.7B
Languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Capabilities: Voice cloning (3-second reference), multilingual TTS
License: Apache 2.0

Downloads last month: 55

Safetensors

Model size

2B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cr2k2/Qwen3-TTS-12Hz-1.7B-Base-fp32

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Finetuned

(18)

this model