Daler0001 commited on
Commit
adb6d71
·
verified ·
1 Parent(s): 5ecdb67

Fix incorrect usage example for VoiceDesign model

Browse files

The README.md previously contained a copy-paste error from the CustomVoice model documentation.

It showed loading the `CustomVoice` model and using `generate_custom_voice` with a speaker parameter. I have updated the code snippet to correctly load the `VoiceDesign` model and use the `generate_voice_design` method with the `instruct` parameter, as this model does not support speaker selection.

Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -21,15 +21,15 @@ tags:
21
  &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
22
  </p>
23
 
24
- We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control.
25
 
26
  ## Overview
27
- Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:
28
 
 
29
  * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
30
  * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
31
  * **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
32
- * **Intelligent Voice Control**: Supports speech generation driven by natural language instructions for flexible control over timbre, emotion, and prosody.
33
 
34
  ## Quickstart
35
 
@@ -48,22 +48,22 @@ import torch
48
  import soundfile as sf
49
  from qwen_tts import Qwen3TTSModel
50
 
51
- # Load the model
52
  model = Qwen3TTSModel.from_pretrained(
53
- "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
54
  device_map="cuda:0",
55
  dtype=torch.bfloat16,
56
  attn_implementation="flash_attention_2",
57
  )
58
 
59
- # Custom Voice Generation
60
- wavs, sr = model.generate_custom_voice(
61
- text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
62
- language="Chinese",
63
- speaker="Vivian",
64
- instruct="用特别愤怒的语气说",
65
  )
66
- sf.write("output.wav", wavs[0], sr)
67
  ```
68
 
69
  ## Evaluation
 
21
  &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
22
  </p>
23
 
24
+ We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen. This specific checkpoint (**VoiceDesign**) offers comprehensive support for generating new voices from natural language descriptions.
25
 
26
  ## Overview
27
+ Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian). Key features:
28
 
29
+ * **Voice Design**: Create unique voices from scratch using natural language instructions (e.g., "An elderly man with a deep, raspy voice").
30
  * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
31
  * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
32
  * **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
 
33
 
34
  ## Quickstart
35
 
 
48
  import soundfile as sf
49
  from qwen_tts import Qwen3TTSModel
50
 
51
+ # Load the VoiceDesign model
52
  model = Qwen3TTSModel.from_pretrained(
53
+ "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
54
  device_map="cuda:0",
55
  dtype=torch.bfloat16,
56
  attn_implementation="flash_attention_2",
57
  )
58
 
59
+ # Voice Design Generation
60
+ # Create a new voice by describing it in the 'instruct' parameter.
61
+ wavs, sr = model.generate_voice_design(
62
+ text="This voice does not exist in the real world; it is created from your text description.",
63
+ language="English",
64
+ instruct="A deep, resonant male voice speaking with a calm and authoritative tone.",
65
  )
66
+ sf.write("output_design.wav", wavs[0], sr)
67
  ```
68
 
69
  ## Evaluation