Nepali Automatic Speech Recognition (ONNX CTC)

This repository contains a Nepali ASR model converted to ONNX and tuned for noisy conditions. The base architecture is from: https://huggingface.co/ai4bharat/indicconformer_stt_ne_hybrid_ctc_rnnt_large

Model summary:

  • Parameters: ~129M
  • Decoder: CTC (greedy decoding)
  • Variants:
    • model_ctc.onnx (FP model, 420.22 MB)
    • model_ctc_quantized.onnx (INT8 quantized, 133.94 MB)

Files

  • model_ctc.onnx: full-size ONNX model
  • model_ctc_quantized.onnx: quantized ONNX model
  • model_config.yaml: NeMo config (preprocessor + vocabulary)
  • local_onnx_asr_inference.ipynb: notebook for local testing
  • instruction.txt: post-upload guide for Hugging Face Spaces

Load and Use the Model

Install dependencies:

pip install onnxruntime soundfile scipy numpy pyyaml omegaconf torch "nemo_toolkit[asr]"

Example inference script:

import numpy as np
import onnxruntime as ort
import soundfile as sf
import torch
import yaml
from omegaconf import OmegaConf
from scipy.signal import resample_poly
from nemo.collections.asr.modules import AudioToMelSpectrogramPreprocessor

ONNX_PATH = "model_ctc_quantized.onnx"  # or "model_ctc.onnx"
CONFIG_PATH = "model_config.yaml"
AUDIO_PATH = "sample.wav"

# Load config
try:
    conf = OmegaConf.load(CONFIG_PATH)
except Exception:
    with open(CONFIG_PATH, "r", encoding="utf-8") as f:
        conf = OmegaConf.create(yaml.safe_load(f))

preprocessor_cfg = OmegaConf.to_container(conf.preprocessor, resolve=True)
preprocessor_cfg.pop("_target_", None)
preprocessor = AudioToMelSpectrogramPreprocessor(**preprocessor_cfg)
preprocessor.eval()
SAMPLE_RATE = preprocessor_cfg["sample_rate"]

vocabulary = (
    conf.get("aux_ctc", {}).get("decoder", {}).get("vocabulary", None)
    or conf.get("decoder", {}).get("vocabulary", None)
)

session = ort.InferenceSession(ONNX_PATH, providers=["CPUExecutionProvider"])
session_ins = session.get_inputs()
main_input = next((x for x in session_ins if "length" not in x.name.lower()), session_ins[0])
length_input = next((x for x in session_ins if "length" in x.name.lower()), None)

def _length_dtype(meta):
    return np.int32 if meta and "int32" in meta.type else np.int64

def decode_ctc(logits, encoded_len, vocab):
    greedy = logits[0].argmax(axis=-1)[: int(encoded_len[0])]
    blank_id = logits.shape[-1] - 1
    collapsed, prev = [], None
    for t in greedy:
        t = int(t)
        if t == prev or t == blank_id:
            prev = t
            continue
        collapsed.append(t)
        prev = t

    if not vocab:
        return str(collapsed)

    text = ""
    for i in collapsed:
        if 0 <= i < len(vocab):
            tok = vocab[i]
            if tok.startswith("##"):
                text += tok[2:]
            elif tok.startswith("โ–"):
                text += " " + tok[1:]
            else:
                text += tok
    return text.strip().replace("โ–", " ")

def transcribe(audio_path: str) -> str:
    audio, sr = sf.read(audio_path)
    if audio.ndim == 2:
        audio = audio.mean(axis=1)
    if sr != SAMPLE_RATE:
        audio = resample_poly(audio, SAMPLE_RATE, sr)

    audio = np.clip(audio, -1.0, 1.0).astype(np.float32)
    audio_len = np.array([audio.shape[0]], dtype=np.int64)

    ort_inputs = {}
    if len(main_input.shape) == 2:
        ort_inputs[main_input.name] = audio[None, :]
        if length_input is not None:
            ort_inputs[length_input.name] = audio_len.astype(_length_dtype(length_input))
    elif len(main_input.shape) == 3:
        with torch.no_grad():
            mel, mel_len = preprocessor(
                input_signal=torch.from_numpy(audio[None, :]),
                length=torch.from_numpy(audio_len),
            )
        ort_inputs[main_input.name] = mel.numpy().astype(np.float32)
        if length_input is not None:
            ort_inputs[length_input.name] = mel_len.numpy().astype(_length_dtype(length_input))

    outputs = session.run(None, ort_inputs)
    logits = next((x for x in outputs if getattr(x, "ndim", 0) == 3), None)
    encoded_len = next((x for x in outputs if getattr(x, "ndim", 0) == 1), None)
    if encoded_len is None:
        encoded_len = np.array([logits.shape[1]], dtype=np.int64)

    return decode_ctc(logits, encoded_len, vocabulary)

print(transcribe(AUDIO_PATH))

Use Directly from Hugging Face Hub

from huggingface_hub import hf_hub_download
 
repo_id = "gam30/nepali-automatic-speech-recognition"
onnx_path = hf_hub_download(repo_id=repo_id, filename="model_ctc_quantized.onnx")
config_path = hf_hub_download(repo_id=repo_id, filename="model_config.yaml")

Then run inference with the same code above, replacing ONNX_PATH and CONFIG_PATH.

Notes

  • The notebook local_onnx_asr_inference.ipynb is the reference test workflow.
  • For better quality, use clean 16 kHz mono audio where possible.
  • Quantized model is faster/smaller; full model provides better accuracy with small margin only.

Citation

If you use this model in your research, project, or application, please cite it as follows:
  APA:
gam30. (2025). Nepali Automatic Speech Recognition (ONNX CTC) [Model]. Hugging Face. https://huggingface.co/gam30/nepali-automatic-speech-recognition
 
BibTeX:
@misc{gam30_nepali_asr,
  author       = {sangam},
  title        = {Nepali Automatic Speech Recognition (ONNX CTC)},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {https://huggingface.co/gam30/nepali-automatic-speech-recognition}
}
  > Please note: This model is based on the architecture from ai4bharat/indicconformer_stt_ne_hybrid_ctc_rnnt_large. When citing, please also acknowledge the original base model authors.

License

If you use or redistribute this model, you must credit gam30/nepali-automatic-speech-recognition as the source as well as AI4Bharat.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using gam30/nepali-automatic-speech-recognition 1