GigaAM-v3 e2e_rnnt — calibrated build for Intel NPU
OpenVINO IR of the encoder of Sber GigaAM-v3
(e2e_rnnt revision), quantized and calibrated for the Intel AI Boost NPU.
- Static input shape:
audio_signal [1, 64, 3000],length [1](30 s mel chunks). - Quantization:
nncf.quantize_with_accuracy_control(MIXED preset) with cosine-similarity validation,max_drop=0.02 absolute. The API performs full PTQ and then selectively dequantizes individual layers until the validation metric is within threshold — so the emitted IR is guaranteed to preserve model quality within the specified drop. - Calibration corpus: 740 real Russian voice samples = 500 Sber Golos (OpenSLR 114) stratified by audio length + 200 Common Voice Spontaneous Speech 3.0 Russian + 56 diverse edge-tts synthetic + 8 author's own recordings + 2 test samples.
- Target: Intel NPU plugin in OpenVINO >= 2025.4. Also runs on CPU/GPU.
- Size: 215 MB (
.bin).
Accuracy
Benchmark: 28-second real Russian dictation scored against the CPU FP32 reference transcript.
| Device / build | Bag-of-words recall | Character Error Rate |
|---|---|---|
This NPU build (quantize_with_accuracy_control) |
96.9% | 1.6% |
| NPU INT8 weight-only (no activation calibration) | 71.9% | 23.4% |
| NPU FP16 baseline (compress_to_fp16) | 71.9% | 22.6% |
| Intel Arc iGPU FP16 (canonical sibling, for reference) | 100.0% | 0.0% |
| CPU FP32 (reference) | 100.0% | 0.0% |
Sample output vs reference (only one substitution):
reference : ...Меня зовут Андрей Сабынин. Я сетеом в Новакарт. Я работаю над проектами...
NPU build : ...Меня зовут Андрей Сабынин. Я сетевой Новакарт. Я работаю над проектами...
^^^^^^^^
Why quantize_with_accuracy_control wins: the standard PTQ modes (MIXED, PERFORMANCE)
apply activation quantization aggressively and fall to ~65–72% BoW on this architecture.
Accuracy-control mode measures the actual output drift per layer and rolls back quantization
exactly where it hurts, emitting a hybrid model. The resulting IR keeps the numerically
most sensitive layers (softmax, layer-norms, selected attention projections) in their
original precision.
First-compile cost
The accuracy-control-produced IR has a hybrid INT8/FP16 topology the Intel NPU compiler spends time globally optimising (layout transforms + op fusion + memory planning over thousands of FakeQuantize boundary nodes).
With the correct NPU plugin properties (see "Fast compile" below) first compile on
Core Ultra 9 285H drops to ~3.5 minutes (was 92 min with default properties, prior
to discovering the right knob set). The compiled blob is cached in $GIGAAM_CACHE_DIR
(default %PROGRAMDATA%\Voice Scribe\gigaam_cache\ under Voice Scribe); subsequent
service starts load in ~1 second.
Fast compile — required NPU properties
Pass this property set to ov.Core before compile_model to cut first-compile
time from 92 min to ~3.5 min (26× speedup) on the same hybrid IR, zero accuracy
impact (weights unchanged, only compilation strategy changes):
core.set_property("NPU", {
"PERFORMANCE_HINT": "LATENCY",
"MODEL_PRIORITY": "HIGH",
"NPU_TURBO": "YES",
"NPU_QDQ_OPTIMIZATION_AGGRESSIVE": "YES",
"COMPILATION_NUM_THREADS": 8,
})
Measured on 285H AI Boost, OpenVINO 2025.4.1, driver 32.0.100.4023:
| Setup | First compile | Warm (cache) |
|---|---|---|
| Default properties | ~92 min | ~1 s |
| Above knob set | ~3.5 min | ~1 s |
These knobs are documented in the NPU plugin's SUPPORTED_PROPERTIES. They are safe
no-ops on non-NPU targets; Voice Scribe's gigaam_backend.py sets them under try/except
so older drivers or plugins without support fall back gracefully.
If 3.5 minutes is still too slow for your deployment, fall back to the canonical Arc iGPU build at Andrewsab/gigaam-v3-e2e-rnnt-ov (15 s compile, 100% accuracy on Arc, but does not run on NPU).
How it was built
Pipeline (reproducer in github.com/andrewsabn/voice-scribe
under scratch/):
transformers.AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True).to_onnx()openvino.convert_model(onnx, input=[("audio_signal", [1, 64, 3000], f32), ("length", [1], i64)])- Stratified calibration dataset: 740 Russian voice clips bucketed by duration (2 s / 5 s / 8 s / 12 s / 15 s).
nncf.quantize_with_accuracy_control(model, calibration_dataset, validation_dataset, validation_fn, max_drop=0.02, drop_type=DropType.ABSOLUTE, subset_size=300)withvalidation_fnreturning mean cosine similarity between the quantized and FP32 encoder outputs over 20 held-out samples.
NNCF 3.1.0, OpenVINO 2025.4.
Usage (with Voice Scribe)
Drop the two files alongside the canonical GigaAM payload (from Andrewsab/gigaam-v3-e2e-rnnt-ov) in one directory:
models/gigaam-v3-e2e-rnnt-ov/
├── v3_e2e_rnnt_encoder.xml # canonical (Arc/CPU, dynamic shape)
├── v3_e2e_rnnt_encoder.bin
├── v3_e2e_rnnt_encoder_static.xml # this repo (NPU-calibrated)
├── v3_e2e_rnnt_encoder_static.bin
├── v3_e2e_rnnt_decoder.xml / .bin
├── v3_e2e_rnnt_joint.xml / .bin
└── tokenizer.model
Voice Scribe's GigaAM backend auto-detects: NPU → static-shape encoder,
Arc/CPU → dynamic encoder. To enable NPU execution, set
DEVICE_GIGAAM=NPU in C:\ProgramData\Voice Scribe\config.env and restart
the service. Or pass /GIGAAM=yes /GIGAAM_NPU=yes to the installer.
License
MIT (inherits from upstream ai-sage/GigaAM-v3).
Model tree for VoiceScribe/voicescribe-gigaam-npu
Base model
ai-sage/GigaAM-v3