Can't Stream The Model Locally

by Sarkkoth77 - opened Jan 25

•

Hi! First off, great work on Qwen3-TTS — the audio quality is impressive, and the model card’s description of the dual-track streaming architecture is exciting.

I wanted to ask for clarification around streaming support as exposed to end users.

The model card and website mention “Extreme Low-Latency Streaming Generation” with first audio packets available almost immediately (~97ms). However, in the currently released Python tooling / examples (e.g. generate_voice_clone), audio appears to be returned only as a fully generated waveform, which makes it difficult to achieve true low-latency, real-time playback in interactive systems (voice assistants).

To be specific:

Is there a publicly exposed streaming / incremental audio generation API (e.g. generator, callback, websocket, chunked output) that users should be using?

Or is streaming currently supported only in internal demos / research code, with a public streaming interface planned for a future update?

I’m asking because the architectural claims strongly suggest real-time audio streaming is possible, but it’s not obvious how to access that capability from the released code.

Thanks again for the release — any clarification on roadmap or recommended usage would be greatly appreciated.

w4coder

Feb 4

Have you checked vllm-omni implementation.

tiggerlee

Feb 5

Have you checked vllm-omni implementation.

No streaming: Audio is generated completely before being returned. Streaming will be supported after the pipeline is disaggregated (see RFC #938).

https://github.com/vllm-project/vllm-omni/issues/938

Sarkkoth

Feb 12

Yeah unfortunately what I'm looking for is fast latency streaming.

BeeegZee

Feb 25

https://github.com/tsdocode/nano-qwen3tts-vllm

This gentleman did create vllm version with audio streaming and RTF 0.3-0.4, ttfa as low as 80-90ms for 1.7B-12Hz model

Sarkkoth

Feb 25

•

edited Feb 25

Thanks. That's closer to what I was wanting since it's really fast. It's still not the plug-and-play conversational streaming I was looking for. But maybe I can patch something together with this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment