flow.decoder.estimator.fp16.onnx returns NaNs on CUDA (CPU works)
Hi! I’m running pure ONNX inference for CosyVoice3 and I’m hitting a GPU-specific NaN issue in the Flow estimator.
When I execute the estimator on CUDAExecutionProvider, the output becomes all NaNs on the very first call, even though all inputs are finite (no NaN/Inf). The same model behaves correctly on CPUExecutionProvider.
Environment (Docker)
- Base image:
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 - Python: 3.10
- ONNX Runtime:
onnxruntime-gpu==1.18.0 - numpy:
1.26.4 - scipy:
1.13.1 - librosa:
0.10.2 - soundfile:
0.12.1 - transformers:
4.51.3
This looks like a CUDA EP numerical issue (possibly an FP16 model + CUDA kernel fusion, e.g. LayerNorm/Softmax/Exp), or a problematic FP16 export that only breaks on CUDA kernels. Since the model input types are float32, it seems to happen inside the graph.
Questions:
- Is there an official
flow.decoder.estimator.fp32.onnxfor CosyVoice3 Flow decoding that is expected to be used on CUDA? - Are there any recommended ORT/CUDA versions or export settings to avoid NaNs?
- Is this a known issue with the FP16 estimator on CUDA?
Could you also share/provide an ONNX export of the Flow estimator in FP32 (e.g. flow.decoder.estimator.fp32.onnx) for comparison/testing?
Right now the issue reproduces with flow.decoder.estimator.fp16.onnx on CUDA (NaNs immediately), while CPU behaves normally. Having the FP32 estimator would help confirm whether this is strictly an FP16+CUDA kernel/fusion problem and would also provide a practical workaround if FP32 runs stably on GPU.
Thanks! I found that an FP32 version of the Flow estimator actually exists in the official model package (flow.decoder.estimator.fp32.onnx). Now I understand why it’s not included/produced in this repo/scripts.
That said, I won’t close this issue yet, because the FP16 estimator still returns NaNs on CUDA EP in my setup. It would be great if we can fix the FP16-on-CUDA stability (or document a recommended workaround / export settings), so GPU inference can run reliably without forcing FP32 for this submodel.