---
license: cc-by-nc-4.0
language:
  - kk
  - ru
  - en
tags:
  - text-to-speech
  - tts
  - voice-cloning
  - qwen3-tts
  - kazakh
  - multilingual
library_name: qwen-tts
pipeline_tag: text-to-speech
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
---

# AIT-Syn — Multilingual Text-to-Speech with Voice Cloning

**AIT-Syn** is a multilingual text-to-speech model supporting **Kazakh**, **Russian**, and **English** with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from `Qwen/Qwen3-TTS-12Hz-1.7B-Base`.

## Supported Languages

| Language | Code |
|----------|------|
| Kazakh   | `kazakh` |
| Russian  | `russian` |
| English  | `english` |

## Model Details

| Property | Value |
|----------|-------|
| Base model | `Qwen/Qwen3-TTS-12Hz-1.7B-Base` |
| Parameters | 1.7B |
| Output sample rate | 24 kHz |

## Installation

```bash
pip install qwen-tts torch soundfile
# Optional: faster attention
pip install flash-attn
```

## Usage

### Voice Cloning with Transcript (Recommended)

Providing the transcript of the reference audio gives the best voice matching quality:

```python
import torch
import soundfile as sf
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

try:
    import flash_attn
    attn_impl = "flash_attention_2"
except ImportError:
    attn_impl = "eager"

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn",
    dtype=torch.bfloat16,
    attn_implementation=attn_impl,
    device_map="cuda:0",
)
model.model.eval()

# Kazakh example
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
    language="kazakh",
    x_vector_only_mode=False,
    non_streaming_mode=True,
    temperature=0.9,
    top_k=50,
    do_sample=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
```

### Voice Cloning without Transcript

If you only have the reference audio (no transcript):

```python
wavs, sr = model.generate_voice_clone(
    text="Hello, this is a test sentence.",
    ref_audio="reference.wav",
    language="english",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
```

### Russian example

```python
wavs, sr = model.generate_voice_clone(
    text="Добрый день! Это тестовое предложение на русском языке.",
    ref_audio="reference.wav",
    language="russian",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
```

## Generation Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `temperature` | 0.9 | Sampling temperature — lower = more stable, higher = more expressive |
| `top_k` | 50 | Top-k sampling |
| `top_p` | 1.0 | Nucleus sampling |
| `repetition_penalty` | 1.0 | Repetition penalty |
| `do_sample` | `True` | Sampling vs greedy decoding |
| `non_streaming_mode` | `True` | Generate full audio before returning |

## Tips

- Output audio is 24 kHz mono
- Reference audio should be clean speech, 5–15 seconds
- Use full language names: `"kazakh"`, `"russian"`, `"english"` (not ISO codes)
- ICL mode (`x_vector_only_mode=False` with `ref_text`) gives better voice matching than x-vector-only mode

## License

This model is released under **CC BY-NC 4.0** (non-commercial use only).

## Commercial Use

For commercial licensing, please contact: **nurgaliqadyrbek@gmail.com**