Access ONNX cache-aware streaming ASR Nemo 560 ms — Vertox-AI

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To access ONNX cache-aware streaming ASR Nemo 560 ms — Vertox-AI, you must review and agree to the CC BY-NC 4.0 license. By submitting this form, you confirm that you have read the license and will only use the model under its terms. Requests are processed immediately.

Log in or Sign Up to review the conditions and access this model content.

ONNX cache-aware streaming ASR Nemo (Conformer-RNNT) [EN-1.12s]

  • Device: CPU

  • Language: English

  • Latency: 1120ms (1 + 13 future context chunks; 1 chunk is 8 frames; 1 frame is 10ms)

Streaming Speech Transcription Pipeline

Real-time English speech transcription: Audio In → ASR → Transcription

Transcribe spoken English into text with streaming input over WebSocket.

Input can only be English for now (due to ASR NeMo).

Architecture

Audio Input → ASR (ONNX) → Transcript Output
  (PCM16)   Conformer RNN-T

See ARCHITECTURE.md for detailed design documentation.

Requirements

  • Python 3.10+
  • Model files:
    • ASR: NeMo Conformer RNN-T ONNX model directory

Installation

pip install -r requirements.txt

System Dependencies

# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2

Usage

Start the Server

  • Recommended to at least use 4 core CPUs, e.g., c5a.xlarge or m8a.xlarge.
python app.py \
  --asr-onnx-path models/ \
  --host 0.0.0.0 \
  --port 8765

CLI Options

Flag Default Description
--asr-onnx-path (required) ASR ONNX model directory
--asr-chunk-ms 10 ASR audio chunk duration (ms)
--asr-sample-rate 16000 ASR expected sample rate
--audio-queue-max 256 Audio input queue max size
--host 0.0.0.0 Server bind host
--port 8765 Server port

Python Client

Captures microphone audio and print out text transcription.

pip install -r requirements_client.txt
python clients/python_client.py --uri ws://localhost:8765

Web Client

TBD

WebSocket Protocol

Direction Type Format Description
Client→ Binary PCM16 Raw audio at declared sample rate
Client→ Text JSON {"action": "start", "sample_rate": 16000}
Client→ Text JSON {"action": "stop"}
→Client Binary PCM16 Synthesized audio at 24kHz
→Client Text JSON {"type": "transcript", "text": "..."}
→Client Text JSON {"type": "status", "status": "started"}

Project Structure

nemo-asr-cache-aware-streaming-1120ms-en-onnx/
├── app.py                              # Main entry point
├── requirements.txt
├── README.md
├── ARCHITECTURE.md
├── models/
│   ├── onnx files
│   ├── config.json
│   ├── vocab.txt
├── src/
│   ├── asr/
│   │   ├── streaming_asr.py            # StreamingASR wrapper
│   │   ├── cache_aware_modules.py      # Audio buffer + streaming ASR
│   │   ├── cache_aware_modules_config.py
│   │   ├── modules.py                  # ONNX model loading
│   │   ├── modules_config.py
│   │   ├── onnx_utils.py
│   │   └── utils.py                    # Audio utilities
│   ├── pipeline/
│   │   ├── orchestrator.py             # PipelineOrchestrator
│   │   └── config.py                   # PipelineConfig
│   └── server/
│       └── websocket_server.py         # WebSocket server
└── clients/
    └── python_client.py                # Python CLI client

Model origin: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b (January2026-branch)

ONNX reference: https://github.com/istupakov/onnx-asr

By: Patrick Lumbantobing
Copyright@VertoX-AI
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pltobing/nemo-asr-cache-aware-streaming-1120ms-en-onnx

Quantized
(8)
this model

Collection including pltobing/nemo-asr-cache-aware-streaming-1120ms-en-onnx