Can't Stream The Model Locally
Hi! First off, great work on Qwen3-TTS — the audio quality is impressive, and the model card’s description of the dual-track streaming architecture is exciting.
I wanted to ask for clarification around streaming support as exposed to end users.
The model card and website mention “Extreme Low-Latency Streaming Generation” with first audio packets available almost immediately (~97ms). However, in the currently released Python tooling / examples (e.g. generate_voice_clone), audio appears to be returned only as a fully generated waveform, which makes it difficult to achieve true low-latency, real-time playback in interactive systems (voice assistants).
To be specific:
Is there a publicly exposed streaming / incremental audio generation API (e.g. generator, callback, websocket, chunked output) that users should be using?
Or is streaming currently supported only in internal demos / research code, with a public streaming interface planned for a future update?
I’m asking because the architectural claims strongly suggest real-time audio streaming is possible, but it’s not obvious how to access that capability from the released code.
Thanks again for the release — any clarification on roadmap or recommended usage would be greatly appreciated.
Have you checked vllm-omni implementation.
Have you checked vllm-omni implementation.
No streaming: Audio is generated completely before being returned. Streaming will be supported after the pipeline is disaggregated (see RFC #938).
Yeah unfortunately what I'm looking for is fast latency streaming.
https://github.com/tsdocode/nano-qwen3tts-vllm
This gentleman did create vllm version with audio streaming and RTF 0.3-0.4, ttfa as low as 80-90ms for 1.7B-12Hz model
Thanks. That's closer to what I was wanting since it's really fast. It's still not the plug-and-play conversational streaming I was looking for. But maybe I can patch something together with this.