GigaAM-v3 e2e_rnnt — calibrated build for Intel NPU

OpenVINO IR of the encoder of Sber GigaAM-v3 (e2e_rnnt revision), quantized and calibrated for the Intel AI Boost NPU.

  • Static input shape: audio_signal [1, 64, 3000], length [1] (30 s mel chunks).
  • Quantization: nncf.quantize_with_accuracy_control (MIXED preset) with cosine-similarity validation, max_drop=0.02 absolute. The API performs full PTQ and then selectively dequantizes individual layers until the validation metric is within threshold — so the emitted IR is guaranteed to preserve model quality within the specified drop.
  • Calibration corpus: 740 real Russian voice samples = 500 Sber Golos (OpenSLR 114) stratified by audio length + 200 Common Voice Spontaneous Speech 3.0 Russian + 56 diverse edge-tts synthetic + 8 author's own recordings + 2 test samples.
  • Target: Intel NPU plugin in OpenVINO >= 2025.4. Also runs on CPU/GPU.
  • Size: 215 MB (.bin).

Accuracy

Benchmark: 28-second real Russian dictation scored against the CPU FP32 reference transcript.

Device / build Bag-of-words recall Character Error Rate
This NPU build (quantize_with_accuracy_control) 96.9% 1.6%
NPU INT8 weight-only (no activation calibration) 71.9% 23.4%
NPU FP16 baseline (compress_to_fp16) 71.9% 22.6%
Intel Arc iGPU FP16 (canonical sibling, for reference) 100.0% 0.0%
CPU FP32 (reference) 100.0% 0.0%

Sample output vs reference (only one substitution):

reference :  ...Меня зовут Андрей Сабынин. Я сетеом в Новакарт. Я работаю над проектами...
NPU build :  ...Меня зовут Андрей Сабынин. Я сетевой Новакарт.   Я работаю над проектами...
                                                ^^^^^^^^

Why quantize_with_accuracy_control wins: the standard PTQ modes (MIXED, PERFORMANCE) apply activation quantization aggressively and fall to ~65–72% BoW on this architecture. Accuracy-control mode measures the actual output drift per layer and rolls back quantization exactly where it hurts, emitting a hybrid model. The resulting IR keeps the numerically most sensitive layers (softmax, layer-norms, selected attention projections) in their original precision.

First-compile cost

The accuracy-control-produced IR has a hybrid INT8/FP16 topology the Intel NPU compiler spends time globally optimising (layout transforms + op fusion + memory planning over thousands of FakeQuantize boundary nodes).

With the correct NPU plugin properties (see "Fast compile" below) first compile on Core Ultra 9 285H drops to ~3.5 minutes (was 92 min with default properties, prior to discovering the right knob set). The compiled blob is cached in $GIGAAM_CACHE_DIR (default %PROGRAMDATA%\Voice Scribe\gigaam_cache\ under Voice Scribe); subsequent service starts load in ~1 second.

Fast compile — required NPU properties

Pass this property set to ov.Core before compile_model to cut first-compile time from 92 min to ~3.5 min (26× speedup) on the same hybrid IR, zero accuracy impact (weights unchanged, only compilation strategy changes):

core.set_property("NPU", {
    "PERFORMANCE_HINT":                "LATENCY",
    "MODEL_PRIORITY":                  "HIGH",
    "NPU_TURBO":                       "YES",
    "NPU_QDQ_OPTIMIZATION_AGGRESSIVE": "YES",
    "COMPILATION_NUM_THREADS":         8,
})

Measured on 285H AI Boost, OpenVINO 2025.4.1, driver 32.0.100.4023:

Setup First compile Warm (cache)
Default properties ~92 min ~1 s
Above knob set ~3.5 min ~1 s

These knobs are documented in the NPU plugin's SUPPORTED_PROPERTIES. They are safe no-ops on non-NPU targets; Voice Scribe's gigaam_backend.py sets them under try/except so older drivers or plugins without support fall back gracefully.

If 3.5 minutes is still too slow for your deployment, fall back to the canonical Arc iGPU build at Andrewsab/gigaam-v3-e2e-rnnt-ov (15 s compile, 100% accuracy on Arc, but does not run on NPU).

How it was built

Pipeline (reproducer in github.com/andrewsabn/voice-scribe under scratch/):

  1. transformers.AutoModel.from_pretrained("ai-sage/GigaAM-v3", revision="e2e_rnnt", trust_remote_code=True).to_onnx()
  2. openvino.convert_model(onnx, input=[("audio_signal", [1, 64, 3000], f32), ("length", [1], i64)])
  3. Stratified calibration dataset: 740 Russian voice clips bucketed by duration (2 s / 5 s / 8 s / 12 s / 15 s).
  4. nncf.quantize_with_accuracy_control(model, calibration_dataset, validation_dataset, validation_fn, max_drop=0.02, drop_type=DropType.ABSOLUTE, subset_size=300) with validation_fn returning mean cosine similarity between the quantized and FP32 encoder outputs over 20 held-out samples.

NNCF 3.1.0, OpenVINO 2025.4.

Usage (with Voice Scribe)

Drop the two files alongside the canonical GigaAM payload (from Andrewsab/gigaam-v3-e2e-rnnt-ov) in one directory:

models/gigaam-v3-e2e-rnnt-ov/
├── v3_e2e_rnnt_encoder.xml         # canonical (Arc/CPU, dynamic shape)
├── v3_e2e_rnnt_encoder.bin
├── v3_e2e_rnnt_encoder_static.xml  # this repo (NPU-calibrated)
├── v3_e2e_rnnt_encoder_static.bin
├── v3_e2e_rnnt_decoder.xml / .bin
├── v3_e2e_rnnt_joint.xml / .bin
└── tokenizer.model

Voice Scribe's GigaAM backend auto-detects: NPU → static-shape encoder, Arc/CPU → dynamic encoder. To enable NPU execution, set DEVICE_GIGAAM=NPU in C:\ProgramData\Voice Scribe\config.env and restart the service. Or pass /GIGAAM=yes /GIGAAM_NPU=yes to the installer.

License

MIT (inherits from upstream ai-sage/GigaAM-v3).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VoiceScribe/voicescribe-gigaam-npu

Finetuned
(7)
this model