--- language: [kk, ru, en, uz] license: cc-by-nc-4.0 pipeline_tag: text-to-speech library_name: transformers tags: - tts - voice-cloning - multilingual - kazakh - uzbek - qwen3-tts --- # AIT-Syn 4L — Multilingual TTS with Voice Cloning A multilingual text-to-speech model supporting **Kazakh**, **Russian**, **English**, and **Uzbek** with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base. ## Features - **4 languages**: Kazakh (kk), Russian (ru), English (en), Uzbek (uz) - **Voice cloning**: clone any voice from a short reference audio (~5–10 s) - **Two cloning modes**: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality) - **12.5 Hz codec**: efficient autoregressive generation - **24 kHz output**: PCM 16-bit WAV ## Quick Start ### Installation ```bash pip install qwen-tts torch soundfile ``` ### Generate Speech ```python from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel import soundfile as sf model = Qwen3TTSModel.from_pretrained( "nur-dev/ait-syn-4L", dtype="bfloat16", device_map="cuda:0", ) model.model.eval() # X-vector-only mode (no ref transcript needed) wavs, sr = model.generate_voice_clone( text="Сәлеметсіз бе, бұл сынақ сөйлем.", language="kazakh", ref_audio="ref_audio_kk.wav", x_vector_only_mode=True, non_streaming_mode=True, ) sf.write("output.wav", wavs[0], sr) # ICL mode (provide ref transcript for better quality) wavs, sr = model.generate_voice_clone( text="Привет, это тестовое предложение.", language="russian", ref_audio="ref_audio_kk.wav", ref_text="Бұл анықтамалық аудио.", x_vector_only_mode=False, non_streaming_mode=True, ) sf.write("output_icl.wav", wavs[0], sr) ``` ## API Reference ### `generate_voice_clone()` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `text` | `str` or `list[str]` | required | Text to synthesize | | `language` | `str` | required | Language name: `kazakh`, `russian`, `english`, `uzbek` | | `ref_audio` | `str` or `(ndarray, sr)` | required | Reference audio: file path, URL, base64, or `(waveform, sample_rate)` | | `ref_text` | `str` or `None` | `None` | Transcript of ref audio (enables ICL mode) | | `x_vector_only_mode` | `bool` | `False` | If `True`, use only x-vector speaker embedding (no ICL) | | `non_streaming_mode` | `bool` | `False` | If `True`, return complete audio; if `False`, return generator | | `temperature` | `float` | `0.9` | Sampling temperature | | `top_k` | `int` | `50` | Top-k sampling | | `top_p` | `float` | `1.0` | Nucleus sampling threshold | | `repetition_penalty` | `float` | `1.05` | Repetition penalty | **Returns**: `(list[np.ndarray], int)` — list of waveforms and sample rate (24000). ## Voice Cloning Modes ### X-vector-only (`x_vector_only_mode=True`) Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript. ### ICL Mode (`x_vector_only_mode=False`, provide `ref_text`) In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available. ## Serving A FastAPI server is available for production deployment: ```bash pip install fastapi uvicorn python-multipart soundfile # Start server python serve_tts.py --model nur-dev/ait-syn-4L --port 8000 # Or with uvicorn directly CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000 ``` ### API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/tts` | POST | Synthesize speech (returns WAV) | | `/tts/batch` | POST | Batch synthesis (returns ZIP of WAVs) | | `/health` | GET | Health check | | `/languages` | GET | List supported languages | ### Example Request ```bash curl -X POST http://localhost:8000/tts \ -F "text=Сәлеметсіз бе" \ -F "language=kk" \ -F "ref_audio=@ref_audio_kk.wav" \ --output output.wav ``` ## Technical Specs | Spec | Value | |------|-------| | Parameters | 1.7B | | Architecture | Qwen3TTSForConditionalGeneration | | Codec rate | 12.5 Hz (16 sub-codecs) | | Output sample rate | 24 kHz | | Precision | bf16 | | Max generation length | 8192 tokens (~10 min audio) | ## Reference Audio A sample Kazakh male reference audio is included as `ref_audio_kk.wav` (mono, 24 kHz, ~10 s). ## License CC-BY-NC-4.0