Voice Scribe mirror gigaam from Andrewsab/gigaam-v3-e2e-rnnt-ov@dff16933a640

4ebe334 verified about 1 month ago

2.94 kB

	---
	license: mit
	language: ru
	library_name: openvino
	tags:
	- speech-recognition
	- russian
	- openvino
	- rnn-t
	- conformer
	- gigaam
	base_model: ai-sage/GigaAM-v3
	pipeline_tag: automatic-speech-recognition
	---

	# GigaAM-v3 e2e_rnnt (OpenVINO IR, pre-converted)

	OpenVINO IR port of [ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) revision `e2e_rnnt` — Sber's SOTA Russian ASR model (220M parameters, Conformer + RNN-T with end-to-end punctuation and capitalization).

	Conversion done with:

	```python
	from transformers import AutoModel
	import torch
	model = AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True)
	model.to_onnx(dir_path="onnx", dtype=torch.float16)
	# then:
	import openvino as ov
	for f in ["encoder", "decoder", "joint"]:
	m = ov.convert_model(f"onnx/v3_e2e_rnnt_{f}.onnx")
	ov.save_model(m, f"v3_e2e_rnnt_{f}.xml")
	```

	## Files

	\| File \| Purpose \| Size \|
	\|------\|---------\|------\|
	\| `v3_e2e_rnnt_encoder.xml/.bin` \| Conformer encoder (main cost) \| ~425 MB FP16 \|
	\| `v3_e2e_rnnt_decoder.xml/.bin` \| RNN-T decoder (prediction network) \| ~2 MB \|
	\| `v3_e2e_rnnt_joint.xml/.bin` \| Joint network \| ~1.3 MB \|
	\| `tokenizer.model` \| SentencePiece vocabulary (1024 subwords) \| 250 KB \|
	\| `config.json` \| Original model config (for reference) \| 2 KB \|

	## Device compatibility (Intel hardware)

	Verified on Intel Core Ultra 9 285H (OpenVINO 2025.4.1):

	\| Device \| Encoder \| Decoder \| Joint \| Usable? \|
	\|--------\|---------\|---------\|-------\|---------\|
	\| CPU \| ✅ \| ✅ \| ✅ \| Yes (~34× RTFx on 10 s chunk) \|
	\| GPU.0 (Arc Xe2 iGPU) \| ✅ \| ✅ \| ✅ \| Yes (~520× RTFx on encoder alone) \|
	\| NPU \| ❌ (dynamic shapes) \| ✅ \| ❌ (dynamic shapes) \| Partial only \|

	Recommended device: Intel Arc iGPU (GPU.0) — fastest and does not compete with NVIDIA for VRAM.

	NPU fails compile on encoder/joint due to dynamic input shapes in the exported ONNX (upper bounds `9223372036854775807`). A re-export with static reshape at 10 s chunks would likely unlock NPU.

	## Usage (Python, pure OpenVINO)

	```python
	import openvino as ov
	core = ov.Core()
	encoder = core.compile_model("v3_e2e_rnnt_encoder.xml", "GPU.0")
	decoder = core.compile_model("v3_e2e_rnnt_decoder.xml", "GPU.0")
	joint = core.compile_model("v3_e2e_rnnt_joint.xml", "GPU.0")

	# Preprocess: audio 16 kHz mono -> log-mel (64 bins, 20 ms win, 10 ms hop)
	# Encoder: features -> encoder outputs
	# Decoder + Joint: RNN-T greedy decode loop -> token IDs
	# SentencePieceProcessor(tokenizer.model).decode(ids) -> text
	```

	A reference Python backend is available in the [Voice Scribe](https://github.com/andrewsabn/voice-scribe) project (MIT license).

	## Credits

	- Original model: [Sber / ai-sage/GigaAM-v3](https://huggingface.co/ai-sage/GigaAM-v3) (MIT)
	- OpenVINO conversion: [Voice Scribe project](https://github.com/andrewsabn/voice-scribe)

	## License

	MIT (matches upstream ai-sage/GigaAM-v3).