---
language: [kk, ru, en, uz]
license: cc-by-nc-4.0
pipeline_tag: text-to-speech
library_name: transformers
tags:
  - tts
  - voice-cloning
  - multilingual
  - kazakh
  - uzbek
  - qwen3-tts
---

# AIT-Syn 4L — Multilingual TTS with Voice Cloning

A multilingual text-to-speech model supporting **Kazakh**, **Russian**, **English**, and **Uzbek** with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base.

## Features

- **4 languages**: Kazakh (kk), Russian (ru), English (en), Uzbek (uz)
- **Voice cloning**: clone any voice from a short reference audio (~5–10 s)
- **Two cloning modes**: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality)
- **12.5 Hz codec**: efficient autoregressive generation
- **24 kHz output**: PCM 16-bit WAV

## Quick Start

### Installation

```bash
pip install qwen-tts torch soundfile
```

### Generate Speech

```python
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
import soundfile as sf

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn-4L",
    dtype="bfloat16",
    device_map="cuda:0",
)
model.model.eval()

# X-vector-only mode (no ref transcript needed)
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлем.",
    language="kazakh",
    ref_audio="ref_audio_kk.wav",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr)

# ICL mode (provide ref transcript for better quality)
wavs, sr = model.generate_voice_clone(
    text="Привет, это тестовое предложение.",
    language="russian",
    ref_audio="ref_audio_kk.wav",
    ref_text="Бұл анықтамалық аудио.",
    x_vector_only_mode=False,
    non_streaming_mode=True,
)
sf.write("output_icl.wav", wavs[0], sr)
```

## API Reference

### `generate_voice_clone()`

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` or `list[str]` | required | Text to synthesize |
| `language` | `str` | required | Language name: `kazakh`, `russian`, `english`, `uzbek` |
| `ref_audio` | `str` or `(ndarray, sr)` | required | Reference audio: file path, URL, base64, or `(waveform, sample_rate)` |
| `ref_text` | `str` or `None` | `None` | Transcript of ref audio (enables ICL mode) |
| `x_vector_only_mode` | `bool` | `False` | If `True`, use only x-vector speaker embedding (no ICL) |
| `non_streaming_mode` | `bool` | `False` | If `True`, return complete audio; if `False`, return generator |
| `temperature` | `float` | `0.9` | Sampling temperature |
| `top_k` | `int` | `50` | Top-k sampling |
| `top_p` | `float` | `1.0` | Nucleus sampling threshold |
| `repetition_penalty` | `float` | `1.05` | Repetition penalty |

**Returns**: `(list[np.ndarray], int)` — list of waveforms and sample rate (24000).

## Voice Cloning Modes

### X-vector-only (`x_vector_only_mode=True`)

Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript.

### ICL Mode (`x_vector_only_mode=False`, provide `ref_text`)

In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available.

## Serving

A FastAPI server is available for production deployment:

```bash
pip install fastapi uvicorn python-multipart soundfile

# Start server
python serve_tts.py --model nur-dev/ait-syn-4L --port 8000

# Or with uvicorn directly
CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000
```

### API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/tts` | POST | Synthesize speech (returns WAV) |
| `/tts/batch` | POST | Batch synthesis (returns ZIP of WAVs) |
| `/health` | GET | Health check |
| `/languages` | GET | List supported languages |

### Example Request

```bash
curl -X POST http://localhost:8000/tts \
  -F "text=Сәлеметсіз бе" \
  -F "language=kk" \
  -F "ref_audio=@ref_audio_kk.wav" \
  --output output.wav
```

## Technical Specs

| Spec | Value |
|------|-------|
| Parameters | 1.7B |
| Architecture | Qwen3TTSForConditionalGeneration |
| Codec rate | 12.5 Hz (16 sub-codecs) |
| Output sample rate | 24 kHz |
| Precision | bf16 |
| Max generation length | 8192 tokens (~10 min audio) |

## Reference Audio

A sample Kazakh male reference audio is included as `ref_audio_kk.wav` (mono, 24 kHz, ~10 s).

## License

CC-BY-NC-4.0