Can't Stream With The Model Locally
Hi! First off, great work on Qwen3-TTS — the audio quality is impressive, and the model card’s description of the dual-track streaming architecture is exciting.
I wanted to ask for clarification around streaming support as exposed to end users.
The model card and website mention “Extreme Low-Latency Streaming Generation” with first audio packets available almost immediately (~97ms). However, in the currently released Python tooling / examples (e.g. generate_voice_clone), audio appears to be returned only as a fully generated waveform, which makes it difficult to achieve true low-latency, real-time playback in interactive systems (voice assistants).
To be specific:
Is there a publicly exposed streaming / incremental audio generation API (e.g. generator, callback, websocket, chunked output) that users should be using?
Or is streaming currently supported only in internal demos / research code, with a public streaming interface planned for a future update?
I’m asking because the architectural claims strongly suggest real-time audio streaming is possible, but it’s not obvious how to access that capability from the released code.
Thanks again for the release — any clarification on roadmap or recommended usage would be greatly appreciated.
Hey, I actually tackled this over the weekend since I needed it for a project.
Here is the repo: https://github.com/CloudWells/qwen3-tts-realtime-streaming
It implements the streaming via manual stepping (KV-cache). Just a heads-up: I deliberately traded some of that "97ms" theoretical latency for stability. My engine buffers the first ~38 tokens (about 3s) before sending the first chunk. In my testing, without that initial context lock, the voice clone tends to drift or glitch immediately.
After that initial ramp-up, it streams smoothly. Hope this helps.