Fine Tuning Qwen3-TTS-12Hz-1.7B/0.6B-Base
๐ Try the live demo on HuggingFace Spaces
How Arabic support was added
The base model (Qwen3-TTS-12Hz-1.7B-Base) ships with a fixed set of languages in its codec token vocabulary. Arabic was not among them. Adding it required changes at three levels: the language embedding table, the input sequence format, and the model config.
1. Arabic language embedding โ warm-start initialization
The codec stream is conditioned on a language token ID. Arabic was assigned codec token ID 2072. Rather than initialising this embedding randomly, it was set to the mean of all existing language embeddings before training began:
ARABIC_LANG_ID = 2072
codec_emb = qwen3tts.model.talker.model.codec_embedding
existing_ids = [v for k, v in config.talker_config.codec_language_id.items() if k != 'arabic']
avg = codec_emb.weight[existing_ids].float().mean(0)
codec_emb.weight[ARABIC_LANG_ID] = avg
This warm start gives the optimizer a stable starting point instead of a random direction, which matters especially for a language with a different script and reading direction.
2. Language-conditioned codec prefix (4-token think block)
The original model uses a 3-token prefix before the speaker embedding slot. To pass the explicit language ID through the codec channel, a 4-token block was introduced:
pos 3: codec_think_id
pos 4: codec_think_bos_id
pos 5: lang_id โ Arabic token 2072 injected here
pos 6: codec_think_eos_id
pos 7: speaker embedding slot โ shifted by +1 vs. base model
This required adjusting the sequence offsets throughout the collator in dataset.py (the +9 vs. the original +8 offset) and setting codec_embedding_mask[7] = False so the speaker embedding at position 7 is injected directly from the speaker encoder rather than looked up from the embedding table.
3. Language auto-detection in the dataset
dataset.py detects Arabic automatically from Unicode range \u0600โ\u06FF, so mixed-language datasets don't need an explicit language field per sample:
def _detect_language(self, text: str) -> str:
for c in text:
if '\u0600' <= c <= '\u06FF':
return 'arabic'
return 'english'
4. Emirati speaker registration
A new speaker ID (3000) was registered for the Emirati voice. The speaker embedding is extracted from a reference audio clip by the frozen speaker encoder, then injected at position 7 of each sequence during the forward pass. At checkpoint save time the embedding is written directly into the safetensors weights so the saved model is fully self-contained.
5. Training setup
| Setting | Value |
|---|---|
| Base model | Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| Optimizer | AdamW, lr=2e-6, weight decay=0.01 |
| Precision | bf16 mixed precision |
| Gradient accumulation | 4 steps (effective batch ~32) |
| Gradient clipping | 1.0 |
| Epochs | 10 |
| Loss | talker_loss + 0.3 ร sub_talker_loss |
All model parameters were fine-tuned (no LoRA). The speaker encoder was kept frozen during training (embeddings extracted with torch.no_grad()).
Inference
Use the included infer.py to synthesize speech from any checkpoint.
Install dependencies:
pip install qwen-tts soundfile torch
Single utterance (loads from HuggingFace automatically):
python infer.py \
--text "ููู ูุงู ููู
ู ุงูููู
ุ ุฅู ุดุงุก ุงููู ูุงู ู
ููุงู ุฎูุฑ." \
--output out.wav
Multiple utterances from a file (one sentence per line):
python infer.py \
--text_file sentences.txt \
--output_dir outputs/
Use a local checkpoint instead:
python infer.py \
--checkpoint output/checkpoint-epoch-9 \
--text "ููู ูุงู ููู
ู ุงูููู
ุ ุฅู ุดุงุก ุงููู ูุงู ู
ููุงู ุฎูุฑ." \
--output out.wav
All arguments:
| Argument | Default | Description |
|---|---|---|
--checkpoint |
vadimbelsky/qwen3.5-TTS-Emirati |
HuggingFace model ID or local checkpoint path |
--text |
โ | Single text string to synthesize |
--text_file |
โ | Text file with one utterance per line |
--language |
arabic |
Language of the input text |
--speaker |
emirati_speaker |
Speaker name stored in the checkpoint |
--output |
output.wav |
Output path for single-utterance mode |
--output_dir |
โ | Output directory for multi-utterance mode |
--device |
cuda:0 |
Torch device |
--max_new_tokens |
2048 |
Maximum codec tokens to generate (increase for longer texts) |
--temperature |
0.9 |
Sampling temperature |
The script prints the duration of each generated file. Outputs are saved as 24 kHz mono WAV.
Samples
Audio samples generated with the fine-tuned model:
Emirati Arabic โ ~1 minute sample (epoch 9 checkpoint)
Emirati dialect text covering everyday topics: morning market visit, coffee with a friend, a walk along the corniche, traditional lunch (harees & mashakik), and an evening majlis with Leiwah music.
Listen: emirati_epoch9_combined.wav
Reference text โ epoch 9 โ "ููู ูุงู ููู ู ุงูููู ุ ุฅู ุดุงุก ุงููู ูุงู ู ููุงู ุฎูุฑ."
The Qwen3-TTS-12Hz-1.7B/0.6B-Base model series currently supports single-speaker fine-tuning. Please run pip install qwen-tts first, then run the command below:
git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS/finetuning
Then follow the steps below to complete the entire fine-tuning workflow. Multi-speaker fine-tuning and other advanced fine-tuning features will be supported in future releases.
1) Input JSONL format
Prepare your training file as a JSONL (one JSON object per line). Each line must contain:
audio: path to the target training audio (wav)text: transcript corresponding toaudioref_audio: path to the reference speaker audio (wav)
Example:
{"audio":"./data/utt0001.wav","text":"ๅ
ถๅฎๆ็็ๆๅ็ฐ๏ผๆๆฏไธไธช็นๅซๅไบ่งๅฏๅซไบบๆ
็ปช็ไบบใ","ref_audio":"./data/ref.wav"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav"}
ref_audio recommendation:
- Strongly recommended: use the same
ref_audiofor all samples. - Keeping
ref_audioidentical across the dataset usually improves speaker consistency and stability during generation.
2) Prepare data (extract audio_codes)
Convert train_raw.jsonl into a training JSONL that includes audio_codes:
python prepare_data.py \
--device cuda:0 \
--tokenizer_model_path Qwen/Qwen3-TTS-Tokenizer-12Hz \
--input_jsonl train_raw.jsonl \
--output_jsonl train_with_codes.jsonl
3) Fine-tune
Run SFT using the prepared JSONL:
python sft_12hz.py \
--init_model_path Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--output_model_path output \
--train_jsonl train_with_codes.jsonl \
--batch_size 32 \
--lr 2e-6 \
--num_epochs 10 \
--speaker_name speaker_test
Checkpoints will be written to:
output/checkpoint-epoch-0output/checkpoint-epoch-1output/checkpoint-epoch-2- ...
4) Quick inference test
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
device = "cuda:0"
tts = Qwen3TTSModel.from_pretrained(
"output/checkpoint-epoch-2",
device_map=device,
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = tts.generate_custom_voice(
text="She said she would be here by noon.",
speaker="speaker_test",
)
sf.write("output.wav", wavs[0], sr)
One-click shell script example
#!/usr/bin/env bash
set -e
DEVICE="cuda:0"
TOKENIZER_MODEL_PATH="Qwen/Qwen3-TTS-Tokenizer-12Hz"
INIT_MODEL_PATH="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
RAW_JSONL="train_raw.jsonl"
TRAIN_JSONL="train_with_codes.jsonl"
OUTPUT_DIR="output"
BATCH_SIZE=2
LR=2e-5
EPOCHS=3
SPEAKER_NAME="speaker_1"
python prepare_data.py \
--device ${DEVICE} \
--tokenizer_model_path ${TOKENIZER_MODEL_PATH} \
--input_jsonl ${RAW_JSONL} \
--output_jsonl ${TRAIN_JSONL}
python sft_12hz.py \
--init_model_path ${INIT_MODEL_PATH} \
--output_model_path ${OUTPUT_DIR} \
--train_jsonl ${TRAIN_JSONL} \
--batch_size ${BATCH_SIZE} \
--lr ${LR} \
--num_epochs ${EPOCHS} \
--speaker_name ${SPEAKER_NAME}
- Downloads last month
- 99
Model tree for vadimbelsky/qwen3.5-TTS-Emirati
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-Base