Qwen3‑TTS‑12Hz‑1.7B‑Base pronounces “he’s” as “his” in English

by russellbal - opened Jan 28

Jan 28

Title

Qwen3‑TTS‑12Hz‑1.7B‑Base pronounces “he’s” as “his” in English

Summary

When synthesizing English speech, Qwen3‑TTS‑12Hz‑1.7B‑Base consistently pronounces the word “he’s” as “his.” This happens even with clean input text and appears to be a text‑frontend or grapheme‑to‑phoneme issue, not a problem with voice cloning data. arxiv

Environment

Model: Qwen/Qwen3-TTS-12Hz-1.7B-Base (and optionally …-CustomVoice if affected) huggingface
Task: English TTS (standard inference, no special prompts)
Input format: Plain text UTF‑8, no SSML
Approximate date of testing: January 2026

(You can add: framework, Python version, GPU/CPU, inference script/SDK.)

Steps to Reproduce

Load Qwen3-TTS-12Hz-1.7B-Base using the recommended pipeline from the model card or the Qwen3‑TTS technical report. huggingface
Use a simple English prompt such as:
- he's happy today.
- He’s going to the store. (with both straight ' and smart ’ apostrophes)
Generate audio with default settings (no special style or language tags).
Listen to the output or inspect the transcription with an external ASR tool.

Expected Behavior

The model should pronounce “he’s” as the contraction of “he is,” i.e., phonetically close to /hiːz/.
Sentences containing “he’s” should sound natural and distinct from “his.”

Actual Behavior

“he’s” is pronounced as “his” (phonetically closer to /hɪz/).
This occurs reliably with different sentences that contain “he’s.”
Other words in the sentence are pronounced correctly, suggesting the issue is localized to this contraction.

Scope and Additional Observations

The issue appears even with clean, synthetic text, so it is likely a frontend or G2P/normalization problem rather than an artifact of specific cloning samples. dev
“he is” is pronounced correctly, indicating the model can realize the intended phonemes when the contraction is expanded.
The problem may extend to other contractions with apostrophes (e.g., “she’s,” “that’s”), but this has not been fully tested yet.

Temporary Workaround

As a workaround, preprocessing the text by expanding contractions (e.g., mapping he's → he is) before sending it to the model avoids the mispronunciation and yields correct audio. dev

Suggested Fix / Request

Review and adjust the English text‑normalization or grapheme‑to‑phoneme rules used before acoustic generation, particularly for apostrophe‑based contractions.
Add tests for common English contractions (“he’s,” “she’s,” “that’s,” “what’s,” etc.) to catch similar issues in future releases.
If the model expects a specific apostrophe or tokenization pattern, document this clearly in the model card or technical report so users can normalize text accordingly. qwen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment