--- license: mit language: ru library_name: openvino tags: - speech-recognition - russian - openvino - rnn-t - conformer - gigaam base_model: ai-sage/GigaAM-v3 pipeline_tag: automatic-speech-recognition --- # GigaAM-v3 e2e_rnnt (OpenVINO IR, pre-converted) OpenVINO IR port of [ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) revision `e2e_rnnt` — Sber's SOTA Russian ASR model (220M parameters, Conformer + RNN-T with end-to-end punctuation and capitalization). Conversion done with: ```python from transformers import AutoModel import torch model = AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True) model.to_onnx(dir_path="onnx", dtype=torch.float16) # then: import openvino as ov for f in ["encoder", "decoder", "joint"]: m = ov.convert_model(f"onnx/v3_e2e_rnnt_{f}.onnx") ov.save_model(m, f"v3_e2e_rnnt_{f}.xml") ``` ## Files | File | Purpose | Size | |------|---------|------| | `v3_e2e_rnnt_encoder.xml/.bin` | Conformer encoder (main cost) | ~425 MB FP16 | | `v3_e2e_rnnt_decoder.xml/.bin` | RNN-T decoder (prediction network) | ~2 MB | | `v3_e2e_rnnt_joint.xml/.bin` | Joint network | ~1.3 MB | | `tokenizer.model` | SentencePiece vocabulary (1024 subwords) | 250 KB | | `config.json` | Original model config (for reference) | 2 KB | ## Device compatibility (Intel hardware) Verified on Intel Core Ultra 9 285H (OpenVINO 2025.4.1): | Device | Encoder | Decoder | Joint | Usable? | |--------|---------|---------|-------|---------| | CPU | ✅ | ✅ | ✅ | Yes (~34× RTFx on 10 s chunk) | | GPU.0 (Arc Xe2 iGPU) | ✅ | ✅ | ✅ | **Yes (~520× RTFx on encoder alone)** | | NPU | ❌ (dynamic shapes) | ✅ | ❌ (dynamic shapes) | Partial only | **Recommended device: Intel Arc iGPU (GPU.0)** — fastest and does not compete with NVIDIA for VRAM. NPU fails compile on encoder/joint due to dynamic input shapes in the exported ONNX (upper bounds `9223372036854775807`). A re-export with static reshape at 10 s chunks would likely unlock NPU. ## Usage (Python, pure OpenVINO) ```python import openvino as ov core = ov.Core() encoder = core.compile_model("v3_e2e_rnnt_encoder.xml", "GPU.0") decoder = core.compile_model("v3_e2e_rnnt_decoder.xml", "GPU.0") joint = core.compile_model("v3_e2e_rnnt_joint.xml", "GPU.0") # Preprocess: audio 16 kHz mono -> log-mel (64 bins, 20 ms win, 10 ms hop) # Encoder: features -> encoder outputs # Decoder + Joint: RNN-T greedy decode loop -> token IDs # SentencePieceProcessor(tokenizer.model).decode(ids) -> text ``` A reference Python backend is available in the [Voice Scribe](https://github.com/andrewsabn/voice-scribe) project (MIT license). ## Credits - Original model: [Sber / ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) (MIT) - OpenVINO conversion: [Voice Scribe project](https://github.com/andrewsabn/voice-scribe) ## License MIT (matches upstream ai-sage/GigaAM-v3).