Fix incorrect usage example for VoiceDesign model
Browse filesThe README.md previously contained a copy-paste error from the CustomVoice model documentation.
It showed loading the `CustomVoice` model and using `generate_custom_voice` with a speaker parameter. I have updated the code snippet to correctly load the `VoiceDesign` model and use the `generate_voice_design` method with the `instruct` parameter, as this model does not support speaker selection.
README.md
CHANGED
|
@@ -21,15 +21,15 @@ tags:
|
|
| 21 |
  🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>   |   📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>   |   📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>   |   💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
|
| 22 |
</p>
|
| 23 |
|
| 24 |
-
We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen
|
| 25 |
|
| 26 |
## Overview
|
| 27 |
-
Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian)
|
| 28 |
|
|
|
|
| 29 |
* **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
|
| 30 |
* **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
|
| 31 |
* **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
|
| 32 |
-
* **Intelligent Voice Control**: Supports speech generation driven by natural language instructions for flexible control over timbre, emotion, and prosody.
|
| 33 |
|
| 34 |
## Quickstart
|
| 35 |
|
|
@@ -48,22 +48,22 @@ import torch
|
|
| 48 |
import soundfile as sf
|
| 49 |
from qwen_tts import Qwen3TTSModel
|
| 50 |
|
| 51 |
-
# Load the model
|
| 52 |
model = Qwen3TTSModel.from_pretrained(
|
| 53 |
-
"Qwen/Qwen3-TTS-12Hz-1.7B-
|
| 54 |
device_map="cuda:0",
|
| 55 |
dtype=torch.bfloat16,
|
| 56 |
attn_implementation="flash_attention_2",
|
| 57 |
)
|
| 58 |
|
| 59 |
-
#
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
instruct="
|
| 65 |
)
|
| 66 |
-
sf.write("
|
| 67 |
```
|
| 68 |
|
| 69 |
## Evaluation
|
|
|
|
| 21 |
  🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>   |   📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>   |   📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>   |   💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
|
| 22 |
</p>
|
| 23 |
|
| 24 |
+
We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen. This specific checkpoint (**VoiceDesign**) offers comprehensive support for generating new voices from natural language descriptions.
|
| 25 |
|
| 26 |
## Overview
|
| 27 |
+
Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian). Key features:
|
| 28 |
|
| 29 |
+
* **Voice Design**: Create unique voices from scratch using natural language instructions (e.g., "An elderly man with a deep, raspy voice").
|
| 30 |
* **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
|
| 31 |
* **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
|
| 32 |
* **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
|
|
|
|
| 33 |
|
| 34 |
## Quickstart
|
| 35 |
|
|
|
|
| 48 |
import soundfile as sf
|
| 49 |
from qwen_tts import Qwen3TTSModel
|
| 50 |
|
| 51 |
+
# Load the VoiceDesign model
|
| 52 |
model = Qwen3TTSModel.from_pretrained(
|
| 53 |
+
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
|
| 54 |
device_map="cuda:0",
|
| 55 |
dtype=torch.bfloat16,
|
| 56 |
attn_implementation="flash_attention_2",
|
| 57 |
)
|
| 58 |
|
| 59 |
+
# Voice Design Generation
|
| 60 |
+
# Create a new voice by describing it in the 'instruct' parameter.
|
| 61 |
+
wavs, sr = model.generate_voice_design(
|
| 62 |
+
text="This voice does not exist in the real world; it is created from your text description.",
|
| 63 |
+
language="English",
|
| 64 |
+
instruct="A deep, resonant male voice speaking with a calm and authoritative tone.",
|
| 65 |
)
|
| 66 |
+
sf.write("output_design.wav", wavs[0], sr)
|
| 67 |
```
|
| 68 |
|
| 69 |
## Evaluation
|