flow.decoder.estimator.fp16.onnx returns NaNs on CUDA (CPU works)

#2
by selectorrrr - opened

Hi! I’m running pure ONNX inference for CosyVoice3 and I’m hitting a GPU-specific NaN issue in the Flow estimator.

When I execute the estimator on CUDAExecutionProvider, the output becomes all NaNs on the very first call, even though all inputs are finite (no NaN/Inf). The same model behaves correctly on CPUExecutionProvider.

Environment (Docker)

  • Base image: nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
  • Python: 3.10
  • ONNX Runtime: onnxruntime-gpu==1.18.0
  • numpy: 1.26.4
  • scipy: 1.13.1
  • librosa: 0.10.2
  • soundfile: 0.12.1
  • transformers: 4.51.3

This looks like a CUDA EP numerical issue (possibly an FP16 model + CUDA kernel fusion, e.g. LayerNorm/Softmax/Exp), or a problematic FP16 export that only breaks on CUDA kernels. Since the model input types are float32, it seems to happen inside the graph.

Questions:

  • Is there an official flow.decoder.estimator.fp32.onnx for CosyVoice3 Flow decoding that is expected to be used on CUDA?
  • Are there any recommended ORT/CUDA versions or export settings to avoid NaNs?
  • Is this a known issue with the FP16 estimator on CUDA?

Could you also share/provide an ONNX export of the Flow estimator in FP32 (e.g. flow.decoder.estimator.fp32.onnx) for comparison/testing?
Right now the issue reproduces with flow.decoder.estimator.fp16.onnx on CUDA (NaNs immediately), while CPU behaves normally. Having the FP32 estimator would help confirm whether this is strictly an FP16+CUDA kernel/fusion problem and would also provide a practical workaround if FP32 runs stably on GPU.

Thanks! I found that an FP32 version of the Flow estimator actually exists in the official model package (flow.decoder.estimator.fp32.onnx). Now I understand why it’s not included/produced in this repo/scripts.

That said, I won’t close this issue yet, because the FP16 estimator still returns NaNs on CUDA EP in my setup. It would be great if we can fix the FP16-on-CUDA stability (or document a recommended workaround / export settings), so GPU inference can run reliably without forcing FP32 for this submodel.

Sign up or log in to comment