--- language: - kk - ru - uz - en license: cc-by-nc-4.0 tags: - automatic-speech-recognition - nemo - fastconformer - ctc - multilingual library_name: nemo pipeline_tag: automatic-speech-recognition --- # FastConformer Multilingual ASR A multilingual automatic speech recognition model supporting **Kazakh, Russian, Uzbek, and English**. Built on NVIDIA NeMo's FastConformer-CTC architecture. ## Languages | Language | Code | |----------|------| | Kazakh | `kk` | | Russian | `ru` | | Uzbek (Latin) | `uz` | | English | `en` | ## Results Full test set (76,739 samples): | Language | Samples | CER | WER | |----------|---------|-----|-----| | Russian | 10,203 | 2.34% | 9.16% | | Kazakh | 33,964 | 8.27% | 14.09% | | Uzbek | 16,184 | 7.10% | 28.82% | | English | 16,388 | 9.53% | 22.29% | | **Overall** | **76,739** | **7.73%** | **16.86%** | ## Usage ### Requirements ```bash pip install nemo_toolkit[asr] ``` ### Transcribe Audio Files ```python import nemo.collections.asr as nemo_asr model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo") model.freeze() transcriptions = model.transcribe(["audio.wav"]) print(transcriptions[0]) ``` ### Real-Time Streaming Transcription ```python import nemo.collections.asr as nemo_asr import sounddevice as sd import numpy as np import queue import tempfile import os import soundfile as sf SAMPLE_RATE = 16000 CHUNK_SEC = 3 CHUNK_SAMPLES = SAMPLE_RATE * CHUNK_SEC model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo") model.freeze() audio_queue = queue.Queue() def audio_callback(indata, frames, time, status): audio_queue.put(indata[:, 0].copy()) def transcribe_stream(): buffer = np.array([], dtype=np.float32) with sd.InputStream( samplerate=SAMPLE_RATE, channels=1, callback=audio_callback, blocksize=SAMPLE_RATE, ): print("Listening... (Ctrl+C to stop)") while True: chunk = audio_queue.get() buffer = np.concatenate([buffer, chunk]) if len(buffer) >= CHUNK_SAMPLES: tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False) sf.write(tmp.name, buffer[:CHUNK_SAMPLES], SAMPLE_RATE) result = model.transcribe([tmp.name]) text = result[0] if isinstance(result[0], str) else result[0].text if text.strip(): print(f"> {text}") os.unlink(tmp.name) buffer = buffer[CHUNK_SAMPLES:] if __name__ == "__main__": transcribe_stream() ``` ### Batch Transcription ```python import nemo.collections.asr as nemo_asr from pathlib import Path model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo") model.freeze() audio_files = list(Path("audio_dir").glob("*.wav")) transcriptions = model.transcribe([str(f) for f in audio_files], batch_size=32) for path, text in zip(audio_files, transcriptions): t = text if isinstance(text, str) else text.text print(f"{path.name}: {t}") ``` ## Model Details - **Architecture**: FastConformer-CTC - **Framework**: NVIDIA NeMo - **Audio**: 16kHz mono WAV ## Limitations - Optimized for clear speech; performance may degrade on noisy audio - No punctuation or capitalization in output - Language is auto-detected, not explicitly specified ## License This model is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). Non-commercial use only.