Qwen3‑TTS‑12Hz‑1.7B‑Base pronounces “he’s” as “his” in English

#7
by russellbal - opened

Title

Qwen3‑TTS‑12Hz‑1.7B‑Base pronounces “he’s” as “his” in English

Summary

When synthesizing English speech, Qwen3‑TTS‑12Hz‑1.7B‑Base consistently pronounces the word “he’s” as “his.” This happens even with clean input text and appears to be a text‑frontend or grapheme‑to‑phoneme issue, not a problem with voice cloning data. arxiv

Environment

  • Model: Qwen/Qwen3-TTS-12Hz-1.7B-Base (and optionally …-CustomVoice if affected) huggingface
  • Task: English TTS (standard inference, no special prompts)
  • Input format: Plain text UTF‑8, no SSML
  • Approximate date of testing: January 2026

(You can add: framework, Python version, GPU/CPU, inference script/SDK.)

Steps to Reproduce

  1. Load Qwen3-TTS-12Hz-1.7B-Base using the recommended pipeline from the model card or the Qwen3‑TTS technical report. huggingface
  2. Use a simple English prompt such as:
    • he's happy today.
    • He’s going to the store. (with both straight ' and smart apostrophes)
  3. Generate audio with default settings (no special style or language tags).
  4. Listen to the output or inspect the transcription with an external ASR tool.

Expected Behavior

  • The model should pronounce “he’s” as the contraction of “he is,” i.e., phonetically close to /hiːz/.
  • Sentences containing “he’s” should sound natural and distinct from “his.”

Actual Behavior

  • “he’s” is pronounced as “his” (phonetically closer to /hɪz/).
  • This occurs reliably with different sentences that contain “he’s.”
  • Other words in the sentence are pronounced correctly, suggesting the issue is localized to this contraction.

Scope and Additional Observations

  • The issue appears even with clean, synthetic text, so it is likely a frontend or G2P/normalization problem rather than an artifact of specific cloning samples. dev
  • “he is” is pronounced correctly, indicating the model can realize the intended phonemes when the contraction is expanded.
  • The problem may extend to other contractions with apostrophes (e.g., “she’s,” “that’s”), but this has not been fully tested yet.

Temporary Workaround

  • As a workaround, preprocessing the text by expanding contractions (e.g., mapping he'she is) before sending it to the model avoids the mispronunciation and yields correct audio. dev

Suggested Fix / Request

  • Review and adjust the English text‑normalization or grapheme‑to‑phoneme rules used before acoustic generation, particularly for apostrophe‑based contractions.
  • Add tests for common English contractions (“he’s,” “she’s,” “that’s,” “what’s,” etc.) to catch similar issues in future releases.
  • If the model expects a specific apostrophe or tokenization pattern, document this clearly in the model card or technical report so users can normalize text accordingly. qwen

Sign up or log in to comment