Low generation speed and low GPU utilization (~12%) during inference
I'm running Qwen3-TTS-12Hz-1.7B-CustomVoice for TTS generation and observing very low GPU utilization during inference (~12%), which results in low generation speed. RTF is x3 - wich means 1 minet audio need 3 minutes generation time.
Environment:
GPU: NVIDIA RTX 3090 (24GB VRAM)
Model: Qwen2.5-Omni-3B (TTS mode only)
Framework: Transformers + torch.compile()
CUDA: 12.x
PyTorch: 2.5.0
Driver: 560.x
During TTS generation, nvidia-smi shows:
GPU Utilization: Only 10-12%
Memory Usage: ~8GB allocated
Power Draw: ~80-100W (vs 350W TDP)
RTF: audio time x3 - very low generation speed
Is this GPU utilization expected for autoregressive TTS generation?
Are there any recommended optimizations for single-request inference?
Would batching multiple requests improve GPU utilization?
Is there a vLLM-compatible version planned for better throughput? (it have lage vram usage and RTF x2.8
Same here on my setup:
$ nvidia-smi
| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 |
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:65:00.0 Off | N/A |
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
from transformers.utils import is_flash_attn_2_available
print("FA2 available:", is_flash_attn_2_available()) # print True
GPU Utilization: 14-16%
Power Draw: ~ 110W
execution time ~ x2
Same on my RTX 4090
Infact I am getting same ( or somewhat better ) speed with attn_implementation = "eager"
This is really slow TTS model. If I break a longer paragraph into sentences, all outputs are significantly different from each other
even with the same configuration ( speaker, instructions etc )
I do not think this is usable.
Hear the full 2 minutes audio.
try this github.com/QwenLM/Qwen3-TTS/issues/89#issuecomment-3799395212
try this github.com/QwenLM/Qwen3-TTS/issues/89#issuecomment-3799395212
This is good. But even 3x speedup is very low speed as RTF is still low. For 1.7b parameter model, ideally 100s audio should be generated in ~10-15 seconds using rtx 4090. I get that speed in vibevoice and coqui.
My RTF is still 3.0
3080m gpu , rtf 4.0 ,
but quality is super nice!