Can't Stream With The Model Locally

#4
by Sarkkoth77 - opened

Hi! First off, great work on Qwen3-TTS — the audio quality is impressive, and the model card’s description of the dual-track streaming architecture is exciting.

I wanted to ask for clarification around streaming support as exposed to end users.

The model card and website mention “Extreme Low-Latency Streaming Generation” with first audio packets available almost immediately (~97ms). However, in the currently released Python tooling / examples (e.g. generate_voice_clone), audio appears to be returned only as a fully generated waveform, which makes it difficult to achieve true low-latency, real-time playback in interactive systems (voice assistants).

To be specific:

Is there a publicly exposed streaming / incremental audio generation API (e.g. generator, callback, websocket, chunked output) that users should be using?

Or is streaming currently supported only in internal demos / research code, with a public streaming interface planned for a future update?

I’m asking because the architectural claims strongly suggest real-time audio streaming is possible, but it’s not obvious how to access that capability from the released code.

Thanks again for the release — any clarification on roadmap or recommended usage would be greatly appreciated.

Hey, I actually tackled this over the weekend since I needed it for a project.

Here is the repo: https://github.com/CloudWells/qwen3-tts-realtime-streaming

It implements the streaming via manual stepping (KV-cache). Just a heads-up: I deliberately traded some of that "97ms" theoretical latency for stability. My engine buffers the first ~38 tokens (about 3s) before sending the first chunk. In my testing, without that initial context lock, the voice clone tends to drift or glitch immediately.

After that initial ramp-up, it streams smoothly. Hope this helps.

Sign up or log in to comment