| --- |
| license: mit |
| language: ru |
| library_name: openvino |
| tags: |
| - speech-recognition |
| - russian |
| - openvino |
| - rnn-t |
| - conformer |
| - gigaam |
| base_model: ai-sage/GigaAM-v3 |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # GigaAM-v3 e2e_rnnt (OpenVINO IR, pre-converted) |
| |
| OpenVINO IR port of [ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) revision `e2e_rnnt` β Sber's SOTA Russian ASR model (220M parameters, Conformer + RNN-T with end-to-end punctuation and capitalization). |
|
|
| Conversion done with: |
|
|
| ```python |
| from transformers import AutoModel |
| import torch |
| model = AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True) |
| model.to_onnx(dir_path="onnx", dtype=torch.float16) |
| # then: |
| import openvino as ov |
| for f in ["encoder", "decoder", "joint"]: |
| m = ov.convert_model(f"onnx/v3_e2e_rnnt_{f}.onnx") |
| ov.save_model(m, f"v3_e2e_rnnt_{f}.xml") |
| ``` |
|
|
| ## Files |
|
|
| | File | Purpose | Size | |
| |------|---------|------| |
| | `v3_e2e_rnnt_encoder.xml/.bin` | Conformer encoder (main cost) | ~425 MB FP16 | |
| | `v3_e2e_rnnt_decoder.xml/.bin` | RNN-T decoder (prediction network) | ~2 MB | |
| | `v3_e2e_rnnt_joint.xml/.bin` | Joint network | ~1.3 MB | |
| | `tokenizer.model` | SentencePiece vocabulary (1024 subwords) | 250 KB | |
| | `config.json` | Original model config (for reference) | 2 KB | |
|
|
| ## Device compatibility (Intel hardware) |
|
|
| Verified on Intel Core Ultra 9 285H (OpenVINO 2025.4.1): |
|
|
| | Device | Encoder | Decoder | Joint | Usable? | |
| |--------|---------|---------|-------|---------| |
| | CPU | β
| β
| β
| Yes (~34Γ RTFx on 10 s chunk) | |
| | GPU.0 (Arc Xe2 iGPU) | β
| β
| β
| **Yes (~520Γ RTFx on encoder alone)** | |
| | NPU | β (dynamic shapes) | β
| β (dynamic shapes) | Partial only | |
|
|
| **Recommended device: Intel Arc iGPU (GPU.0)** β fastest and does not compete with NVIDIA for VRAM. |
|
|
| NPU fails compile on encoder/joint due to dynamic input shapes in the exported ONNX (upper bounds `9223372036854775807`). A re-export with static reshape at 10 s chunks would likely unlock NPU. |
|
|
| ## Usage (Python, pure OpenVINO) |
|
|
| ```python |
| import openvino as ov |
| core = ov.Core() |
| encoder = core.compile_model("v3_e2e_rnnt_encoder.xml", "GPU.0") |
| decoder = core.compile_model("v3_e2e_rnnt_decoder.xml", "GPU.0") |
| joint = core.compile_model("v3_e2e_rnnt_joint.xml", "GPU.0") |
| |
| # Preprocess: audio 16 kHz mono -> log-mel (64 bins, 20 ms win, 10 ms hop) |
| # Encoder: features -> encoder outputs |
| # Decoder + Joint: RNN-T greedy decode loop -> token IDs |
| # SentencePieceProcessor(tokenizer.model).decode(ids) -> text |
| ``` |
|
|
| A reference Python backend is available in the [Voice Scribe](https://github.com/andrewsabn/voice-scribe) project (MIT license). |
|
|
| ## Credits |
|
|
| - Original model: [Sber / ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) (MIT) |
| - OpenVINO conversion: [Voice Scribe project](https://github.com/andrewsabn/voice-scribe) |
|
|
| ## License |
|
|
| MIT (matches upstream ai-sage/GigaAM-v3). |
|
|