voicescribe-gigaam / README.md
Andrewsab's picture
Voice Scribe mirror gigaam from Andrewsab/gigaam-v3-e2e-rnnt-ov@dff16933a640
4ebe334 verified
---
license: mit
language: ru
library_name: openvino
tags:
- speech-recognition
- russian
- openvino
- rnn-t
- conformer
- gigaam
base_model: ai-sage/GigaAM-v3
pipeline_tag: automatic-speech-recognition
---
# GigaAM-v3 e2e_rnnt (OpenVINO IR, pre-converted)
OpenVINO IR port of [ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) revision `e2e_rnnt` β€” Sber's SOTA Russian ASR model (220M parameters, Conformer + RNN-T with end-to-end punctuation and capitalization).
Conversion done with:
```python
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True)
model.to_onnx(dir_path="onnx", dtype=torch.float16)
# then:
import openvino as ov
for f in ["encoder", "decoder", "joint"]:
m = ov.convert_model(f"onnx/v3_e2e_rnnt_{f}.onnx")
ov.save_model(m, f"v3_e2e_rnnt_{f}.xml")
```
## Files
| File | Purpose | Size |
|------|---------|------|
| `v3_e2e_rnnt_encoder.xml/.bin` | Conformer encoder (main cost) | ~425 MB FP16 |
| `v3_e2e_rnnt_decoder.xml/.bin` | RNN-T decoder (prediction network) | ~2 MB |
| `v3_e2e_rnnt_joint.xml/.bin` | Joint network | ~1.3 MB |
| `tokenizer.model` | SentencePiece vocabulary (1024 subwords) | 250 KB |
| `config.json` | Original model config (for reference) | 2 KB |
## Device compatibility (Intel hardware)
Verified on Intel Core Ultra 9 285H (OpenVINO 2025.4.1):
| Device | Encoder | Decoder | Joint | Usable? |
|--------|---------|---------|-------|---------|
| CPU | βœ… | βœ… | βœ… | Yes (~34Γ— RTFx on 10 s chunk) |
| GPU.0 (Arc Xe2 iGPU) | βœ… | βœ… | βœ… | **Yes (~520Γ— RTFx on encoder alone)** |
| NPU | ❌ (dynamic shapes) | βœ… | ❌ (dynamic shapes) | Partial only |
**Recommended device: Intel Arc iGPU (GPU.0)** β€” fastest and does not compete with NVIDIA for VRAM.
NPU fails compile on encoder/joint due to dynamic input shapes in the exported ONNX (upper bounds `9223372036854775807`). A re-export with static reshape at 10 s chunks would likely unlock NPU.
## Usage (Python, pure OpenVINO)
```python
import openvino as ov
core = ov.Core()
encoder = core.compile_model("v3_e2e_rnnt_encoder.xml", "GPU.0")
decoder = core.compile_model("v3_e2e_rnnt_decoder.xml", "GPU.0")
joint = core.compile_model("v3_e2e_rnnt_joint.xml", "GPU.0")
# Preprocess: audio 16 kHz mono -> log-mel (64 bins, 20 ms win, 10 ms hop)
# Encoder: features -> encoder outputs
# Decoder + Joint: RNN-T greedy decode loop -> token IDs
# SentencePieceProcessor(tokenizer.model).decode(ids) -> text
```
A reference Python backend is available in the [Voice Scribe](https://github.com/andrewsabn/voice-scribe) project (MIT license).
## Credits
- Original model: [Sber / ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) (MIT)
- OpenVINO conversion: [Voice Scribe project](https://github.com/andrewsabn/voice-scribe)
## License
MIT (matches upstream ai-sage/GigaAM-v3).