multilingual-asr / README.md
nur-dev's picture
Upload README.md with huggingface_hub
ee10c0a verified
metadata
language:
  - kk
  - ru
  - uz
  - en
license: cc-by-nc-4.0
tags:
  - automatic-speech-recognition
  - nemo
  - fastconformer
  - ctc
  - multilingual
library_name: nemo
pipeline_tag: automatic-speech-recognition

FastConformer Multilingual ASR

A multilingual automatic speech recognition model supporting Kazakh, Russian, Uzbek, and English. Built on NVIDIA NeMo's FastConformer-CTC architecture.

Languages

Language Code
Kazakh kk
Russian ru
Uzbek (Latin) uz
English en

Results

Full test set (76,739 samples):

Language Samples CER WER
Russian 10,203 2.34% 9.16%
Kazakh 33,964 8.27% 14.09%
Uzbek 16,184 7.10% 28.82%
English 16,388 9.53% 22.29%
Overall 76,739 7.73% 16.86%

Usage

Requirements

pip install nemo_toolkit[asr]

Transcribe Audio Files

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()

transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])

Real-Time Streaming Transcription

import nemo.collections.asr as nemo_asr
import sounddevice as sd
import numpy as np
import queue
import tempfile
import os
import soundfile as sf

SAMPLE_RATE = 16000
CHUNK_SEC = 3
CHUNK_SAMPLES = SAMPLE_RATE * CHUNK_SEC

model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()

audio_queue = queue.Queue()


def audio_callback(indata, frames, time, status):
    audio_queue.put(indata[:, 0].copy())


def transcribe_stream():
    buffer = np.array([], dtype=np.float32)
    with sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=1,
        callback=audio_callback,
        blocksize=SAMPLE_RATE,
    ):
        print("Listening... (Ctrl+C to stop)")
        while True:
            chunk = audio_queue.get()
            buffer = np.concatenate([buffer, chunk])
            if len(buffer) >= CHUNK_SAMPLES:
                tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
                sf.write(tmp.name, buffer[:CHUNK_SAMPLES], SAMPLE_RATE)
                result = model.transcribe([tmp.name])
                text = result[0] if isinstance(result[0], str) else result[0].text
                if text.strip():
                    print(f"> {text}")
                os.unlink(tmp.name)
                buffer = buffer[CHUNK_SAMPLES:]


if __name__ == "__main__":
    transcribe_stream()

Batch Transcription

import nemo.collections.asr as nemo_asr
from pathlib import Path

model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()

audio_files = list(Path("audio_dir").glob("*.wav"))
transcriptions = model.transcribe([str(f) for f in audio_files], batch_size=32)

for path, text in zip(audio_files, transcriptions):
    t = text if isinstance(text, str) else text.text
    print(f"{path.name}: {t}")

Model Details

  • Architecture: FastConformer-CTC
  • Framework: NVIDIA NeMo
  • Audio: 16kHz mono WAV

Limitations

  • Optimized for clear speech; performance may degrade on noisy audio
  • No punctuation or capitalization in output
  • Language is auto-detected, not explicitly specified

License

This model is released under CC BY-NC 4.0. Non-commercial use only.