Nepali Automatic Speech Recognition (ONNX CTC)
This repository contains a Nepali ASR model converted to ONNX and tuned for noisy conditions. The base architecture is from: https://huggingface.co/ai4bharat/indicconformer_stt_ne_hybrid_ctc_rnnt_large
Model summary:
- Parameters: ~129M
- Decoder: CTC (greedy decoding)
- Variants:
model_ctc.onnx(FP model, 420.22 MB)model_ctc_quantized.onnx(INT8 quantized, 133.94 MB)
Files
model_ctc.onnx: full-size ONNX modelmodel_ctc_quantized.onnx: quantized ONNX modelmodel_config.yaml: NeMo config (preprocessor + vocabulary)local_onnx_asr_inference.ipynb: notebook for local testinginstruction.txt: post-upload guide for Hugging Face Spaces
Load and Use the Model
Install dependencies:
pip install onnxruntime soundfile scipy numpy pyyaml omegaconf torch "nemo_toolkit[asr]"
Example inference script:
import numpy as np
import onnxruntime as ort
import soundfile as sf
import torch
import yaml
from omegaconf import OmegaConf
from scipy.signal import resample_poly
from nemo.collections.asr.modules import AudioToMelSpectrogramPreprocessor
ONNX_PATH = "model_ctc_quantized.onnx" # or "model_ctc.onnx"
CONFIG_PATH = "model_config.yaml"
AUDIO_PATH = "sample.wav"
# Load config
try:
conf = OmegaConf.load(CONFIG_PATH)
except Exception:
with open(CONFIG_PATH, "r", encoding="utf-8") as f:
conf = OmegaConf.create(yaml.safe_load(f))
preprocessor_cfg = OmegaConf.to_container(conf.preprocessor, resolve=True)
preprocessor_cfg.pop("_target_", None)
preprocessor = AudioToMelSpectrogramPreprocessor(**preprocessor_cfg)
preprocessor.eval()
SAMPLE_RATE = preprocessor_cfg["sample_rate"]
vocabulary = (
conf.get("aux_ctc", {}).get("decoder", {}).get("vocabulary", None)
or conf.get("decoder", {}).get("vocabulary", None)
)
session = ort.InferenceSession(ONNX_PATH, providers=["CPUExecutionProvider"])
session_ins = session.get_inputs()
main_input = next((x for x in session_ins if "length" not in x.name.lower()), session_ins[0])
length_input = next((x for x in session_ins if "length" in x.name.lower()), None)
def _length_dtype(meta):
return np.int32 if meta and "int32" in meta.type else np.int64
def decode_ctc(logits, encoded_len, vocab):
greedy = logits[0].argmax(axis=-1)[: int(encoded_len[0])]
blank_id = logits.shape[-1] - 1
collapsed, prev = [], None
for t in greedy:
t = int(t)
if t == prev or t == blank_id:
prev = t
continue
collapsed.append(t)
prev = t
if not vocab:
return str(collapsed)
text = ""
for i in collapsed:
if 0 <= i < len(vocab):
tok = vocab[i]
if tok.startswith("##"):
text += tok[2:]
elif tok.startswith("โ"):
text += " " + tok[1:]
else:
text += tok
return text.strip().replace("โ", " ")
def transcribe(audio_path: str) -> str:
audio, sr = sf.read(audio_path)
if audio.ndim == 2:
audio = audio.mean(axis=1)
if sr != SAMPLE_RATE:
audio = resample_poly(audio, SAMPLE_RATE, sr)
audio = np.clip(audio, -1.0, 1.0).astype(np.float32)
audio_len = np.array([audio.shape[0]], dtype=np.int64)
ort_inputs = {}
if len(main_input.shape) == 2:
ort_inputs[main_input.name] = audio[None, :]
if length_input is not None:
ort_inputs[length_input.name] = audio_len.astype(_length_dtype(length_input))
elif len(main_input.shape) == 3:
with torch.no_grad():
mel, mel_len = preprocessor(
input_signal=torch.from_numpy(audio[None, :]),
length=torch.from_numpy(audio_len),
)
ort_inputs[main_input.name] = mel.numpy().astype(np.float32)
if length_input is not None:
ort_inputs[length_input.name] = mel_len.numpy().astype(_length_dtype(length_input))
outputs = session.run(None, ort_inputs)
logits = next((x for x in outputs if getattr(x, "ndim", 0) == 3), None)
encoded_len = next((x for x in outputs if getattr(x, "ndim", 0) == 1), None)
if encoded_len is None:
encoded_len = np.array([logits.shape[1]], dtype=np.int64)
return decode_ctc(logits, encoded_len, vocabulary)
print(transcribe(AUDIO_PATH))
Use Directly from Hugging Face Hub
from huggingface_hub import hf_hub_download
repo_id = "gam30/nepali-automatic-speech-recognition"
onnx_path = hf_hub_download(repo_id=repo_id, filename="model_ctc_quantized.onnx")
config_path = hf_hub_download(repo_id=repo_id, filename="model_config.yaml")
Then run inference with the same code above, replacing ONNX_PATH and CONFIG_PATH.
Notes
- The notebook
local_onnx_asr_inference.ipynbis the reference test workflow. - For better quality, use clean 16 kHz mono audio where possible.
- Quantized model is faster/smaller; full model provides better accuracy with small margin only.
Citation
If you use this model in your research, project, or application, please cite it as follows:
APA:
gam30. (2025). Nepali Automatic Speech Recognition (ONNX CTC) [Model]. Hugging Face. https://huggingface.co/gam30/nepali-automatic-speech-recognition
BibTeX:
@misc{gam30_nepali_asr,
author = {sangam},
title = {Nepali Automatic Speech Recognition (ONNX CTC)},
year = {2025},
publisher = {Hugging Face},
howpublished = {https://huggingface.co/gam30/nepali-automatic-speech-recognition}
}
> Please note: This model is based on the architecture from ai4bharat/indicconformer_stt_ne_hybrid_ctc_rnnt_large. When citing, please also acknowledge the original base model authors.
License
If you use or redistribute this model, you must credit gam30/nepali-automatic-speech-recognition as the source as well as AI4Bharat.
- Downloads last month
- 11