GigaAM: Efficient Self-Supervised Learner for Speech Recognition
Paper โข 2506.01192 โข Published
GigaAM-v3 is a Conformer-based foundation model with 220โ240M parameters, pretrained on diverse Russian speech data using the HuBERT-CTC objective. It is the third generation of the GigaAM family and provides state-of-the-art performance on Russian ASR across a wide range of domains.
Sherpa-ONNX Compatibility: These models have been modified for use with the sherpa-onnx runtime. The ONNX files include embedded metadata and have been adapted to meet the inference engine requirements.
GigaAM-v3-sherpa-onnx includes the following model variants:
ctc โ ASR model fine-tuned with a CTC decoder rnnt โ ASR model fine-tuned with an RNN-T decoder e2e_ctc โ end-to-end CTC model with punctuation and text normalization e2e_rnnt โ end-to-end RNN-T model with punctuation and text normalization./sherpa-onnx-offline-websocket-server \
--nemo-ctc-model=gigaam_v3_e2e_ctc.onnx \
--tokens=gigaam_v3_e2e_ctc_tokens.txt \
--port=6006
./sherpa-onnx-offline-websocket-server \
--encoder=gigaam_v3_e2e_rnnt_encoder.onnx \
--decoder=gigaam_v3_e2e_rnnt_decoder.onnx \
--joiner=gigaam_v3_e2e_rnnt_joint.onnx \
--tokens=gigaam_v3_e2e_rnnt_tokens.txt \
--port=6006
import asyncio
import numpy
import wave
import websockets
SHERPA_ONNX_SERVER = "127.0.0.1:6006"
async def transcribe_audio(wave_filename: str):
with wave.open(wave_filename, 'rb') as f:
samples_int16 = numpy.frombuffer(f.readframes(-1), dtype=numpy.int16)
samples_float32 = samples_int16.astype(numpy.float32) / 32768.0
async with websockets.connect(f"ws://{SHERPA_ONNX_SERVER}") as websocket:
buf = (16000).to_bytes(4, "little")
buf+= (samples_float32.nbytes).to_bytes(4, "little")
buf+= samples_float32.tobytes()
payload_len = 10240
for i in range(0, len(buf), payload_len):
await websocket.send(buf[i : i + payload_len])
result = await websocket.recv()
await websocket.send("Done")
return result
answer = asyncio.run(transcribe_audio("example.wav"))
print(answer)
License: MIT
Paper: GigaAM: Efficient Self-Supervised Learner for Speech Recognition (InterSpeech 2025)