nvidia/nemotron-speech-streaming-en-0.6b · missing punctuation marks

missing punctuation marks

#11

by Kerwin11 - opened Feb 7

Discussion

Kerwin11

Feb 7

•

edited Feb 7

Dear NVIDIA Team,

Thank you very much for open-sourcing the Nemotron model; it has been incredibly helpful for our work. During usage, I've encountered some issues and would appreciate your guidance.

My Environment Setup:

Python Version: 3.13.11
NeMo Version: 2.6.2

Script Used:
https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

Test Audio File:
alan_podcast2_convert.wav from https://huggingface.co/datasets/Kerwin11/audio_data/tree/main
The audio information is as follows:
samplerate: 16000 Hz
channels: 1
duration: 1e+01:51.563 min
format: WAV (Microsoft) [WAV]
subtype: Signed 16 bit PCM [PCM_16]

Command Used:

python speech_to_text_cache_aware_streaming_infer.py \
    model_path="/workspace/nemotron-speech-streaming-en-0.6b.nemo" \
    audio_file="/workspace/alan_podcast2_convert.wav" \
    compare_vs_offline=false \
    att_context_size="[70,13]" \
    amp=false \
    debug_mode=false \
    rnnt_decoding.greedy.use_cuda_graph_decoder=false

I have to set rnnt_decoding.greedy.use_cuda_graph_decoder to false; otherwise, the program crashes. I've tested on RTX 2080 Ti and RTX 4090 GPUs, and I'm unsure if this is related to the GPU model or CUDA version, but enabling it causes errors. Partial error log is as follows, and there's a similar issue on GitHub:

https://github.com/NVIDIA-NeMo/NeMo/issues/15340

ValueError: not enough values to unpack (expected 6, got 5)

Although this setting doesn't affect the overall inference results, the generated transcription text is missing punctuation at the end of the audio. The generated text file is available at:
speech_text.txt from https://huggingface.co/datasets/Kerwin11/audio_data/tree/main

I believe I'm using the model correctly as per the documentation and examples, yet this punctuation loss occurs. Could you please advise on what might be causing this? Is it related to the caching mechanism, context size settings, or other factors? I'd appreciate your analysis and suggestions for optimization.

Thank you for your time!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment