MedASR INT8 Quantized (ONNX)
Quantized version of google/medasr for on-device deployment.
Models
| File | Format | Size | WER (LibriSpeech test-clean) |
|---|---|---|---|
medasr_int8_dynamic.onnx |
INT8 Dynamic | 101 MB | 23.5% |
medasr_fp16.onnx |
FP16 | 201 MB | โ |
Key Details
- Base model: google/medasr (105M params, LASR/Conformer architecture)
- Quantization: ONNX Runtime dynamic INT8
- Input: Mel spectrogram (
input_features: [batch, time, 128]) +attention_mask - Output: CTC logits (512 vocab) โ requires beam search decoding
- Original FP32 size: 402 MB โ INT8: 101 MB (4x reduction)
- WER delta vs FP32: +1.1%
- Token-level agreement with FP32: 97.4%
Usage
import onnxruntime as ort
import librosa, numpy as np
from transformers import AutoProcessor
from pyctcdecode import build_ctcdecoder
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download("whitelotus0/medasr-int8-onnx", "medasr_int8_dynamic.onnx")
# Load processor (from original model)
processor = AutoProcessor.from_pretrained("google/medasr", trust_remote_code=True)
# Build CTC decoder
vocab = processor.tokenizer.get_vocab()
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])
labels = [""] + [sorted_vocab[i][0] for i in range(1, 512)]
decoder = build_ctcdecoder(labels=labels)
# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)
# Run inference
session = ort.InferenceSession(model_path)
inputs = processor(audio, sampling_rate=16000, return_tensors="np", padding=True)
input_names = [inp.name for inp in session.get_inputs()]
feed = {name: inputs[name] for name in input_names}
logits = session.run(None, feed)[0]
text = decoder.decode(logits[0])
print(text)
Deployment
Designed for offline mobile deployment via:
- ONNX Runtime Mobile (Android NNAPI / iOS CoreML)
- sherpa-onnx SDK (recommended โ handles audio preprocessing + CTC decoding)
Notes
- WER benchmarked on LibriSpeech test-clean (general English). MedASR is optimized for medical dictation where Google reports ~5% WER.
- CTC beam search decoding is required. Greedy argmax produces stuttered/repeated tokens.
- INT8 Static quantization requires 50+ calibration samples for good results.
- Downloads last month
- 7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for whitelotus0/medasr-int8-onnx
Base model
google/medasr