---
license: mit
language: ru
library_name: openvino
tags:
  - speech-recognition
  - russian
  - openvino
  - rnn-t
  - conformer
  - gigaam
base_model: ai-sage/GigaAM-v3
pipeline_tag: automatic-speech-recognition
---

# GigaAM-v3 e2e_rnnt (OpenVINO IR, pre-converted)

OpenVINO IR port of [ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) revision `e2e_rnnt` — Sber's SOTA Russian ASR model (220M parameters, Conformer + RNN-T with end-to-end punctuation and capitalization).

Conversion done with:

```python
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True)
model.to_onnx(dir_path="onnx", dtype=torch.float16)
# then:
import openvino as ov
for f in ["encoder", "decoder", "joint"]:
    m = ov.convert_model(f"onnx/v3_e2e_rnnt_{f}.onnx")
    ov.save_model(m, f"v3_e2e_rnnt_{f}.xml")
```

## Files

| File | Purpose | Size |
|------|---------|------|
| `v3_e2e_rnnt_encoder.xml/.bin` | Conformer encoder (main cost) | ~425 MB FP16 |
| `v3_e2e_rnnt_decoder.xml/.bin` | RNN-T decoder (prediction network) | ~2 MB |
| `v3_e2e_rnnt_joint.xml/.bin` | Joint network | ~1.3 MB |
| `tokenizer.model` | SentencePiece vocabulary (1024 subwords) | 250 KB |
| `config.json` | Original model config (for reference) | 2 KB |

## Device compatibility (Intel hardware)

Verified on Intel Core Ultra 9 285H (OpenVINO 2025.4.1):

| Device | Encoder | Decoder | Joint | Usable? |
|--------|---------|---------|-------|---------|
| CPU | ✅ | ✅ | ✅ | Yes (~34× RTFx on 10 s chunk) |
| GPU.0 (Arc Xe2 iGPU) | ✅ | ✅ | ✅ | **Yes (~520× RTFx on encoder alone)** |
| NPU | ❌ (dynamic shapes) | ✅ | ❌ (dynamic shapes) | Partial only |

**Recommended device: Intel Arc iGPU (GPU.0)** — fastest and does not compete with NVIDIA for VRAM.

NPU fails compile on encoder/joint due to dynamic input shapes in the exported ONNX (upper bounds `9223372036854775807`). A re-export with static reshape at 10 s chunks would likely unlock NPU.

## Usage (Python, pure OpenVINO)

```python
import openvino as ov
core = ov.Core()
encoder = core.compile_model("v3_e2e_rnnt_encoder.xml", "GPU.0")
decoder = core.compile_model("v3_e2e_rnnt_decoder.xml", "GPU.0")
joint   = core.compile_model("v3_e2e_rnnt_joint.xml",   "GPU.0")

# Preprocess: audio 16 kHz mono -> log-mel (64 bins, 20 ms win, 10 ms hop)
# Encoder: features -> encoder outputs
# Decoder + Joint: RNN-T greedy decode loop -> token IDs
# SentencePieceProcessor(tokenizer.model).decode(ids) -> text
```

A reference Python backend is available in the [Voice Scribe](https://github.com/andrewsabn/voice-scribe) project (MIT license).

## Credits

- Original model: [Sber / ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) (MIT)
- OpenVINO conversion: [Voice Scribe project](https://github.com/andrewsabn/voice-scribe)

## License

MIT (matches upstream ai-sage/GigaAM-v3).