Fix incorrect usage example for VoiceDesign model

The README.md previously contained a copy-paste error from the CustomVoice model documentation.

It showed loading the `CustomVoice` model and using `generate_custom_voice` with a speaker parameter. I have updated the code snippet to correctly load the `VoiceDesign` model and use the `generate_voice_design` method with the `instruct` parameter, as this model does not support speaker selection.

Files changed (1) hide show

README.md +12 -12

README.md CHANGED Viewed

@@ -21,15 +21,15 @@ tags:
 &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
 </p>
-We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control.
 ## Overview
-Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:
 * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
 * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
 * **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
-* **Intelligent Voice Control**: Supports speech generation driven by natural language instructions for flexible control over timbre, emotion, and prosody.
 ## Quickstart
@@ -48,22 +48,22 @@ import torch
 import soundfile as sf
 from qwen_tts import Qwen3TTSModel
-# Load the model
 model = Qwen3TTSModel.from_pretrained(
-    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
     device_map="cuda:0",
     dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
 )
-# Custom Voice Generation
-wavs, sr = model.generate_custom_voice(
-    text="其实我真的有发现，我是一个特别善于观察别人情绪的人。",
-    language="Chinese",
-    speaker="Vivian",
-    instruct="用特别愤怒的语气说",
 )
-sf.write("output.wav", wavs[0], sr)
 ```
 ## Evaluation

 &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
 </p>
+We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen. This specific checkpoint (**VoiceDesign**) offers comprehensive support for generating new voices from natural language descriptions.
 ## Overview
+Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian). Key features:
+* **Voice Design**: Create unique voices from scratch using natural language instructions (e.g., "An elderly man with a deep, raspy voice").
 * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
 * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
 * **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
 ## Quickstart
 import soundfile as sf
 from qwen_tts import Qwen3TTSModel
+# Load the VoiceDesign model
 model = Qwen3TTSModel.from_pretrained(
+    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
     device_map="cuda:0",
     dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
 )
+# Voice Design Generation
+# Create a new voice by describing it in the 'instruct' parameter.
+wavs, sr = model.generate_voice_design(
+    text="This voice does not exist in the real world; it is created from your text description.",
+    language="English",
+    instruct="A deep, resonant male voice speaking with a calm and authoritative tone.",
 )
+sf.write("output_design.wav", wavs[0], sr)
 ```
 ## Evaluation