Qwen3‑TTS‑12Hz‑1.7B‑Base pronounces “he’s” as “his” in English
#7
by russellbal - opened
Title
Qwen3‑TTS‑12Hz‑1.7B‑Base pronounces “he’s” as “his” in English
Summary
When synthesizing English speech, Qwen3‑TTS‑12Hz‑1.7B‑Base consistently pronounces the word “he’s” as “his.” This happens even with clean input text and appears to be a text‑frontend or grapheme‑to‑phoneme issue, not a problem with voice cloning data. arxiv
Environment
- Model:
Qwen/Qwen3-TTS-12Hz-1.7B-Base(and optionally…-CustomVoiceif affected) huggingface - Task: English TTS (standard inference, no special prompts)
- Input format: Plain text UTF‑8, no SSML
- Approximate date of testing: January 2026
(You can add: framework, Python version, GPU/CPU, inference script/SDK.)
Steps to Reproduce
- Load
Qwen3-TTS-12Hz-1.7B-Baseusing the recommended pipeline from the model card or the Qwen3‑TTS technical report. huggingface - Use a simple English prompt such as:
he's happy today.He’s going to the store.(with both straight'and smart’apostrophes)
- Generate audio with default settings (no special style or language tags).
- Listen to the output or inspect the transcription with an external ASR tool.
Expected Behavior
- The model should pronounce “he’s” as the contraction of “he is,” i.e., phonetically close to /hiːz/.
- Sentences containing “he’s” should sound natural and distinct from “his.”
Actual Behavior
- “he’s” is pronounced as “his” (phonetically closer to /hɪz/).
- This occurs reliably with different sentences that contain “he’s.”
- Other words in the sentence are pronounced correctly, suggesting the issue is localized to this contraction.
Scope and Additional Observations
- The issue appears even with clean, synthetic text, so it is likely a frontend or G2P/normalization problem rather than an artifact of specific cloning samples. dev
- “he is” is pronounced correctly, indicating the model can realize the intended phonemes when the contraction is expanded.
- The problem may extend to other contractions with apostrophes (e.g., “she’s,” “that’s”), but this has not been fully tested yet.
Temporary Workaround
- As a workaround, preprocessing the text by expanding contractions (e.g., mapping
he's→he is) before sending it to the model avoids the mispronunciation and yields correct audio. dev
Suggested Fix / Request
- Review and adjust the English text‑normalization or grapheme‑to‑phoneme rules used before acoustic generation, particularly for apostrophe‑based contractions.
- Add tests for common English contractions (“he’s,” “she’s,” “that’s,” “what’s,” etc.) to catch similar issues in future releases.
- If the model expects a specific apostrophe or tokenization pattern, document this clearly in the model card or technical report so users can normalize text accordingly. qwen