Access ONNX cache-aware streaming ASR Nemo 560 ms — Vertox-AI

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To access ONNX cache-aware streaming ASR Nemo 560 ms — Vertox-AI, you must review and agree to the CC BY-NC 4.0 license. By submitting this form, you confirm that you have read the license and will only use the model under its terms. Requests are processed immediately.

ONNX cache-aware streaming ASR Nemo (Conformer-RNNT) [EN-1.12s]

Device: CPU
Language: English
Latency: 1120ms (1 + 13 future context chunks; 1 chunk is 8 frames; 1 frame is 10ms)

Streaming Speech Transcription Pipeline

Real-time English speech transcription: Audio In → ASR → Transcription

Transcribe spoken English into text with streaming input over WebSocket.

Input can only be English for now (due to ASR NeMo).

Architecture

Audio Input → ASR (ONNX) → Transcript Output
  (PCM16)   Conformer RNN-T

ASR: NVIDIA NeMo Conformer RNN-T (cache-aware streaming, ONNX)

See ARCHITECTURE.md for detailed design documentation.

Requirements

Python 3.10+
Model files:
- ASR: NeMo Conformer RNN-T ONNX model directory

Installation

pip install -r requirements.txt

System Dependencies

# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2

Usage

Start the Server

Recommended to at least use 4 core CPUs, e.g., c5a.xlarge or m8a.xlarge.

python app.py \
  --asr-onnx-path models/ \
  --host 0.0.0.0 \
  --port 8765

CLI Options

Flag	Default	Description
`--asr-onnx-path`	(required)	ASR ONNX model directory
`--asr-chunk-ms`	10	ASR audio chunk duration (ms)
`--asr-sample-rate`	16000	ASR expected sample rate
`--audio-queue-max`	256	Audio input queue max size
`--host`	0.0.0.0	Server bind host
`--port`	8765	Server port

Python Client

Captures microphone audio and print out text transcription.

pip install -r requirements_client.txt
python clients/python_client.py --uri ws://localhost:8765

Web Client

TBD

WebSocket Protocol

Direction	Type	Format	Description
Client→	Binary	PCM16	Raw audio at declared sample rate
Client→	Text	JSON	`{"action": "start", "sample_rate": 16000}`
Client→	Text	JSON	`{"action": "stop"}`
→Client	Binary	PCM16	Synthesized audio at 24kHz
→Client	Text	JSON	`{"type": "transcript", "text": "..."}`
→Client	Text	JSON	`{"type": "status", "status": "started"}`

Project Structure

nemo-asr-cache-aware-streaming-1120ms-en-onnx/
├── app.py                              # Main entry point
├── requirements.txt
├── README.md
├── ARCHITECTURE.md
├── models/
│   ├── onnx files
│   ├── config.json
│   ├── vocab.txt
├── src/
│   ├── asr/
│   │   ├── streaming_asr.py            # StreamingASR wrapper
│   │   ├── cache_aware_modules.py      # Audio buffer + streaming ASR
│   │   ├── cache_aware_modules_config.py
│   │   ├── modules.py                  # ONNX model loading
│   │   ├── modules_config.py
│   │   ├── onnx_utils.py
│   │   └── utils.py                    # Audio utilities
│   ├── pipeline/
│   │   ├── orchestrator.py             # PipelineOrchestrator
│   │   └── config.py                   # PipelineConfig
│   └── server/
│       └── websocket_server.py         # WebSocket server
└── clients/
    └── python_client.py                # Python CLI client

Model origin: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b (January2026-branch)

ONNX reference: https://github.com/istupakov/onnx-asr

By: Patrick Lumbantobing

Copyright@VertoX-AI

Downloads last month: -

Model tree for pltobing/nemo-asr-cache-aware-streaming-1120ms-en-onnx

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Quantized

(8)

this model

Collection including pltobing/nemo-asr-cache-aware-streaming-1120ms-en-onnx

Streaming ASR

Collection

Streaming speech recognition models and frameworks • 2 items • Updated 24 days ago