--- license: cc-by-nc-4.0 language: - kk - ru - en tags: - text-to-speech - tts - voice-cloning - qwen3-tts - kazakh - multilingual library_name: qwen-tts pipeline_tag: text-to-speech base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base --- # AIT-Syn — Multilingual Text-to-Speech with Voice Cloning **AIT-Syn** is a multilingual text-to-speech model supporting **Kazakh**, **Russian**, and **English** with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from `Qwen/Qwen3-TTS-12Hz-1.7B-Base`. ## Supported Languages | Language | Code | |----------|------| | Kazakh | `kazakh` | | Russian | `russian` | | English | `english` | ## Model Details | Property | Value | |----------|-------| | Base model | `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | | Parameters | 1.7B | | Output sample rate | 24 kHz | ## Installation ```bash pip install qwen-tts torch soundfile # Optional: faster attention pip install flash-attn ``` ## Usage ### Voice Cloning with Transcript (Recommended) Providing the transcript of the reference audio gives the best voice matching quality: ```python import torch import soundfile as sf from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel try: import flash_attn attn_impl = "flash_attention_2" except ImportError: attn_impl = "eager" model = Qwen3TTSModel.from_pretrained( "nur-dev/ait-syn", dtype=torch.bfloat16, attn_implementation=attn_impl, device_map="cuda:0", ) model.model.eval() # Kazakh example wavs, sr = model.generate_voice_clone( text="Сәлеметсіз бе, бұл сынақ сөйлемі.", ref_audio="reference.wav", ref_text="Transcript of the reference audio.", language="kazakh", x_vector_only_mode=False, non_streaming_mode=True, temperature=0.9, top_k=50, do_sample=True, ) sf.write("output.wav", wavs[0], sr, subtype="PCM_16") ``` ### Voice Cloning without Transcript If you only have the reference audio (no transcript): ```python wavs, sr = model.generate_voice_clone( text="Hello, this is a test sentence.", ref_audio="reference.wav", language="english", x_vector_only_mode=True, non_streaming_mode=True, ) sf.write("output.wav", wavs[0], sr, subtype="PCM_16") ``` ### Russian example ```python wavs, sr = model.generate_voice_clone( text="Добрый день! Это тестовое предложение на русском языке.", ref_audio="reference.wav", language="russian", x_vector_only_mode=True, non_streaming_mode=True, ) sf.write("output.wav", wavs[0], sr, subtype="PCM_16") ``` ## Generation Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `temperature` | 0.9 | Sampling temperature — lower = more stable, higher = more expressive | | `top_k` | 50 | Top-k sampling | | `top_p` | 1.0 | Nucleus sampling | | `repetition_penalty` | 1.0 | Repetition penalty | | `do_sample` | `True` | Sampling vs greedy decoding | | `non_streaming_mode` | `True` | Generate full audio before returning | ## Tips - Output audio is 24 kHz mono - Reference audio should be clean speech, 5–15 seconds - Use full language names: `"kazakh"`, `"russian"`, `"english"` (not ISO codes) - ICL mode (`x_vector_only_mode=False` with `ref_text`) gives better voice matching than x-vector-only mode ## License This model is released under **CC BY-NC 4.0** (non-commercial use only). ## Commercial Use For commercial licensing, please contact: **nurgaliqadyrbek@gmail.com**