flow.decoder.estimator.fp16.onnx returns NaNs on CUDA (CPU works)

by selectorrrr - opened Jan 21

Jan 21

Hi! I’m running pure ONNX inference for CosyVoice3 and I’m hitting a GPU-specific NaN issue in the Flow estimator.

When I execute the estimator on CUDAExecutionProvider, the output becomes all NaNs on the very first call, even though all inputs are finite (no NaN/Inf). The same model behaves correctly on CPUExecutionProvider.

Environment (Docker)

Base image: nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
Python: 3.10
ONNX Runtime: onnxruntime-gpu==1.18.0
numpy: 1.26.4
scipy: 1.13.1
librosa: 0.10.2
soundfile: 0.12.1
transformers: 4.51.3

This looks like a CUDA EP numerical issue (possibly an FP16 model + CUDA kernel fusion, e.g. LayerNorm/Softmax/Exp), or a problematic FP16 export that only breaks on CUDA kernels. Since the model input types are float32, it seems to happen inside the graph.

Questions:

Is there an official flow.decoder.estimator.fp32.onnx for CosyVoice3 Flow decoding that is expected to be used on CUDA?
Are there any recommended ORT/CUDA versions or export settings to avoid NaNs?
Is this a known issue with the FP16 estimator on CUDA?

Could you also share/provide an ONNX export of the Flow estimator in FP32 (e.g. flow.decoder.estimator.fp32.onnx) for comparison/testing?
Right now the issue reproduces with flow.decoder.estimator.fp16.onnx on CUDA (NaNs immediately), while CPU behaves normally. Having the FP32 estimator would help confirm whether this is strictly an FP16+CUDA kernel/fusion problem and would also provide a practical workaround if FP32 runs stably on GPU.

selectorrrr

Jan 21

Thanks! I found that an FP32 version of the Flow estimator actually exists in the official model package (flow.decoder.estimator.fp32.onnx). Now I understand why it’s not included/produced in this repo/scripts.

That said, I won’t close this issue yet, because the FP16 estimator still returns NaNs on CUDA EP in my setup. It would be great if we can fix the FP16-on-CUDA stability (or document a recommended workaround / export settings), so GPU inference can run reliably without forcing FP32 for this submodel.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment