Fine Tuning Qwen3-TTS-12Hz-1.7B/0.6B-Base

🚀 Try the live demo on HuggingFace Spaces

How Arabic support was added

The base model (Qwen3-TTS-12Hz-1.7B-Base) ships with a fixed set of languages in its codec token vocabulary. Arabic was not among them. Adding it required changes at three levels: the language embedding table, the input sequence format, and the model config.

1. Arabic language embedding — warm-start initialization

The codec stream is conditioned on a language token ID. Arabic was assigned codec token ID 2072. Rather than initialising this embedding randomly, it was set to the mean of all existing language embeddings before training began:

ARABIC_LANG_ID = 2072
codec_emb = qwen3tts.model.talker.model.codec_embedding
existing_ids = [v for k, v in config.talker_config.codec_language_id.items() if k != 'arabic']
avg = codec_emb.weight[existing_ids].float().mean(0)
codec_emb.weight[ARABIC_LANG_ID] = avg

This warm start gives the optimizer a stable starting point instead of a random direction, which matters especially for a language with a different script and reading direction.

2. Language-conditioned codec prefix (4-token think block)

The original model uses a 3-token prefix before the speaker embedding slot. To pass the explicit language ID through the codec channel, a 4-token block was introduced:

pos 3: codec_think_id
pos 4: codec_think_bos_id
pos 5: lang_id          ← Arabic token 2072 injected here
pos 6: codec_think_eos_id
pos 7: speaker embedding slot  ← shifted by +1 vs. base model

This required adjusting the sequence offsets throughout the collator in dataset.py (the +9 vs. the original +8 offset) and setting codec_embedding_mask[7] = False so the speaker embedding at position 7 is injected directly from the speaker encoder rather than looked up from the embedding table.

3. Language auto-detection in the dataset

dataset.py detects Arabic automatically from Unicode range \u0600–\u06FF, so mixed-language datasets don't need an explicit language field per sample:

def _detect_language(self, text: str) -> str:
    for c in text:
        if '\u0600' <= c <= '\u06FF':
            return 'arabic'
    return 'english'

4. Emirati speaker registration

A new speaker ID (3000) was registered for the Emirati voice. The speaker embedding is extracted from a reference audio clip by the frozen speaker encoder, then injected at position 7 of each sequence during the forward pass. At checkpoint save time the embedding is written directly into the safetensors weights so the saved model is fully self-contained.

5. Training setup

Setting	Value
Base model	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`
Optimizer	AdamW, lr=2e-6, weight decay=0.01
Precision	bf16 mixed precision
Gradient accumulation	4 steps (effective batch ~32)
Gradient clipping	1.0
Epochs	10
Loss	`talker_loss + 0.3 × sub_talker_loss`

All model parameters were fine-tuned (no LoRA). The speaker encoder was kept frozen during training (embeddings extracted with torch.no_grad()).

Inference

Use the included infer.py to synthesize speech from any checkpoint.

Install dependencies:

pip install qwen-tts soundfile torch

Single utterance (loads from HuggingFace automatically):

python infer.py \
    --text "كيف كان يومك اليوم؟ إن شاء الله كان مليان خير." \
    --output out.wav

Multiple utterances from a file (one sentence per line):

python infer.py \
    --text_file sentences.txt \
    --output_dir outputs/

Use a local checkpoint instead:

python infer.py \
    --checkpoint output/checkpoint-epoch-9 \
    --text "كيف كان يومك اليوم؟ إن شاء الله كان مليان خير." \
    --output out.wav

All arguments:

Argument	Default	Description
`--checkpoint`	`vadimbelsky/qwen3.5-TTS-Emirati`	HuggingFace model ID or local checkpoint path
`--text`	—	Single text string to synthesize
`--text_file`	—	Text file with one utterance per line
`--language`	`arabic`	Language of the input text
`--speaker`	`emirati_speaker`	Speaker name stored in the checkpoint
`--output`	`output.wav`	Output path for single-utterance mode
`--output_dir`	—	Output directory for multi-utterance mode
`--device`	`cuda:0`	Torch device
`--max_new_tokens`	`2048`	Maximum codec tokens to generate (increase for longer texts)
`--temperature`	`0.9`	Sampling temperature

The script prints the duration of each generated file. Outputs are saved as 24 kHz mono WAV.

Samples

Audio samples generated with the fine-tuned model:

Emirati Arabic — ~1 minute sample (epoch 9 checkpoint)

Emirati dialect text covering everyday topics: morning market visit, coffee with a friend, a walk along the corniche, traditional lunch (harees & mashakik), and an evening majlis with Leiwah music.

Listen: emirati_epoch9_combined.wav

Reference text — epoch 9 — "كيف كان يومك اليوم؟ إن شاء الله كان مليان خير."

Listen: ref_text_epoch9.wav

The Qwen3-TTS-12Hz-1.7B/0.6B-Base model series currently supports single-speaker fine-tuning. Please run pip install qwen-tts first, then run the command below:

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS/finetuning

Then follow the steps below to complete the entire fine-tuning workflow. Multi-speaker fine-tuning and other advanced fine-tuning features will be supported in future releases.

1) Input JSONL format

Prepare your training file as a JSONL (one JSON object per line). Each line must contain:

audio: path to the target training audio (wav)
text: transcript corresponding to audio
ref_audio: path to the reference speaker audio (wav)

Example:

{"audio":"./data/utt0001.wav","text":"其实我真的有发现，我是一个特别善于观察别人情绪的人。","ref_audio":"./data/ref.wav"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav"}

ref_audio recommendation:

Strongly recommended: use the same ref_audio for all samples.
Keeping ref_audio identical across the dataset usually improves speaker consistency and stability during generation.

2) Prepare data (extract `audio_codes`)

Convert train_raw.jsonl into a training JSONL that includes audio_codes:

python prepare_data.py \
  --device cuda:0 \
  --tokenizer_model_path Qwen/Qwen3-TTS-Tokenizer-12Hz \
  --input_jsonl train_raw.jsonl \
  --output_jsonl train_with_codes.jsonl

3) Fine-tune

Run SFT using the prepared JSONL:

python sft_12hz.py \
  --init_model_path Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --output_model_path output \
  --train_jsonl train_with_codes.jsonl \
  --batch_size 32 \
  --lr 2e-6 \
  --num_epochs 10 \
  --speaker_name speaker_test

Checkpoints will be written to:

output/checkpoint-epoch-0
output/checkpoint-epoch-1
output/checkpoint-epoch-2
...

4) Quick inference test

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

device = "cuda:0"
tts = Qwen3TTSModel.from_pretrained(
    "output/checkpoint-epoch-2",
    device_map=device,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = tts.generate_custom_voice(
    text="She said she would be here by noon.",
    speaker="speaker_test",
)
sf.write("output.wav", wavs[0], sr)

One-click shell script example

#!/usr/bin/env bash
set -e

DEVICE="cuda:0"
TOKENIZER_MODEL_PATH="Qwen/Qwen3-TTS-Tokenizer-12Hz"
INIT_MODEL_PATH="Qwen/Qwen3-TTS-12Hz-1.7B-Base"

RAW_JSONL="train_raw.jsonl"
TRAIN_JSONL="train_with_codes.jsonl"
OUTPUT_DIR="output"

BATCH_SIZE=2
LR=2e-5
EPOCHS=3
SPEAKER_NAME="speaker_1"

python prepare_data.py \
  --device ${DEVICE} \
  --tokenizer_model_path ${TOKENIZER_MODEL_PATH} \
  --input_jsonl ${RAW_JSONL} \
  --output_jsonl ${TRAIN_JSONL}

python sft_12hz.py \
  --init_model_path ${INIT_MODEL_PATH} \
  --output_model_path ${OUTPUT_DIR} \
  --train_jsonl ${TRAIN_JSONL} \
  --batch_size ${BATCH_SIZE} \
  --lr ${LR} \
  --num_epochs ${EPOCHS} \
  --speaker_name ${SPEAKER_NAME}

Downloads last month: 99

Model tree for vadimbelsky/qwen3.5-TTS-Emirati

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Finetuned

(18)

this model

vadimbelsky
/

qwen3.5-TTS-Emirati

Fine Tuning Qwen3-TTS-12Hz-1.7B/0.6B-Base

How Arabic support was added

1. Arabic language embedding — warm-start initialization

2. Language-conditioned codec prefix (4-token think block)

3. Language auto-detection in the dataset

4. Emirati speaker registration

5. Training setup

Inference

Samples

1) Input JSONL format

2) Prepare data (extract `audio_codes`)

3) Fine-tune

4) Quick inference test

One-click shell script example

Model tree for vadimbelsky/qwen3.5-TTS-Emirati

Space using vadimbelsky/qwen3.5-TTS-Emirati 1

Fine Tuning Qwen3-TTS-12Hz-1.7B/0.6B-Base

How Arabic support was added

1. Arabic language embedding — warm-start initialization

2. Language-conditioned codec prefix (4-token think block)

3. Language auto-detection in the dataset

4. Emirati speaker registration

5. Training setup

Inference

Samples

1) Input JSONL format

2) Prepare data (extract audio_codes)

3) Fine-tune

4) Quick inference test

One-click shell script example

Model tree for vadimbelsky/qwen3.5-TTS-Emirati

Space using vadimbelsky/qwen3.5-TTS-Emirati 1

2) Prepare data (extract `audio_codes`)