cohere-transcribe-03-2026 β INT8 ONNX
INT8 dynamic quantization of CohereLabs/cohere-transcribe-03-2026 (2B-parameter multilingual ASR) exported to ONNX.
Runs on CPU, Apple Silicon, and GPU via ONNX Runtime β no PyTorch required at inference time.
| Original (safetensors) | FP32 ONNX | INT8 ONNX | INT4 ONNX | |
|---|---|---|---|---|
| Size | 3.80 GB | 7.79 GB | 2.69 GB | 1.95 GB |
| vs Original | baseline | +105% | -29% | -49% |
| Avg WER (5 datasets) | β | 15.55% | 15.77% | 21.81% |
| ΞWER | β | β | +0.22% | +3.55% |
| Speed (M-series CPU) | β | 0.8s/sample | 0.6s/sample | 1.4s/sample |
| Runtime | PyTorch | ONNX Runtime | ONNX Runtime | ONNX Runtime |
Deployment Options
| Platform | Supported | Provider | Notes |
|---|---|---|---|
| CPU (x86/ARM) | Yes | CPUExecutionProvider |
Best general-purpose option. INT8 runs ~25% faster than FP32. |
| Apple Silicon (M1/M2/M3/M4) | Yes | CoreMLExecutionProvider or CPUExecutionProvider |
Native ARM NEON acceleration. Unified memory means no CPU-GPU transfer cost. |
| NVIDIA GPU (CUDA) | Yes | CUDAExecutionProvider |
Requires onnxruntime-gpu. Best for batch/server workloads. |
| NVIDIA TensorRT | Yes | TensorrtExecutionProvider |
Maximum GPU throughput. Requires TensorRT installed. |
| DirectML (Windows GPU) | Yes | DmlExecutionProvider |
AMD/Intel/NVIDIA GPUs on Windows via DirectX 12. |
| GGUF / llama.cpp | No | β | Not applicable. This is a Conformer ASR model, not an LLM. No GGUF conversion exists. |
| Web / WASM | Experimental | WasmExecutionProvider |
Via onnxruntime-web. Model size may be prohibitive for browser use. |
Quick Start
Install
pip install onnxruntime numpy soundfile transformers
For GPU support:
pip install onnxruntime-gpu # replaces onnxruntime
For Apple Silicon CoreML acceleration:
pip install onnxruntime # CoreML provider is included by default on macOS
Download
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="vigneshlabs/cohere-transcribe-03-2026-int8-onnx",
local_dir="./cohere-int8",
)
Or with Git LFS:
git lfs install
git clone https://huggingface.co/vigneshlabs/cohere-transcribe-03-2026-int8-onnx
Files
cohere-encoder.int8.onnx 101 MB Conformer encoder graph (INT8)
cohere-encoder.int8.onnx.data 2.4 GB Encoder weights (external data)
cohere-decoder.int8.onnx 569 KB Transformer decoder graph (INT8)
cohere-decoder.int8.onnx.data 146 MB Decoder weights (external data)
tokens.txt 203 KB Vocabulary (16,384 tokens)
Total: 2.69 GB
Usage Examples
1. CPU Inference (any platform)
import numpy as np
import onnxruntime as ort
import soundfile as sf
from transformers import AutoProcessor
# ββ Config ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MODEL_DIR = "./cohere-int8"
DECODER_START_ID = 13764
EOS_ID = 3
EN_LANG_ID = 62 # see Language Tokens section below
MAX_NEW_TOKENS = 256
# ββ Tokenizer ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
id_to_token = {}
with open(f"{MODEL_DIR}/tokens.txt") as f:
for line in f:
parts = line.rstrip("\n").split(" ")
id_to_token[int(parts[-1])] = " ".join(parts[:-1])
def detokenize(ids):
pieces = [id_to_token.get(i, "") for i in ids
if not (id_to_token.get(i, "").startswith("<|")
and id_to_token.get(i, "").endswith("|>"))]
return "".join(pieces).replace("β", " ").strip()
# ββ Load processor (for mel spectrogram extraction) ββββββββββββββββ
processor = AutoProcessor.from_pretrained(
"CohereLabs/cohere-transcribe-03-2026", trust_remote_code=True
)
# ββ Load ONNX sessions on CPU βββββββββββββββββββββββββββββββββββββ
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
encoder = ort.InferenceSession(
f"{MODEL_DIR}/cohere-encoder.int8.onnx", opts,
providers=["CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
f"{MODEL_DIR}/cohere-decoder.int8.onnx", opts,
providers=["CPUExecutionProvider"]
)
# ββ Load & preprocess audio βββββββββββββββββββββββββββββββββββββββ
audio, sr = sf.read("audio.wav")
if audio.ndim > 1:
audio = audio.mean(axis=1) # stereo β mono
mel = processor(
audio.astype(np.float32), sampling_rate=sr, return_tensors="np"
)["input_features"].astype(np.float32)
# ββ Encode ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
cross_k, cross_v = encoder.run(
["n_layer_cross_k", "n_layer_cross_v"], {"mel": mel}
)
# ββ Decode (greedy) ββββββββββββββββββββββββββββββββββββββββββββββ
n_layers = cross_k.shape[0] # 8
self_k = np.zeros((n_layers, 1, 8, 1024, 128), dtype=np.float32)
self_v = np.zeros((n_layers, 1, 8, 1024, 128), dtype=np.float32)
tokens = np.array([[DECODER_START_ID, EN_LANG_ID]], dtype=np.int64)
offset = np.array(0, dtype=np.int64)
generated = []
for _ in range(MAX_NEW_TOKENS):
logits, self_k, self_v = decoder.run(
["logits", "out_n_layer_self_k_cache", "out_n_layer_self_v_cache"],
{
"tokens": tokens,
"in_n_layer_self_k_cache": self_k,
"in_n_layer_self_v_cache": self_v,
"n_layer_cross_k": cross_k,
"n_layer_cross_v": cross_v,
"offset": offset,
},
)
next_token = int(np.argmax(logits[0, -1, :]))
if next_token == EOS_ID:
break
generated.append(next_token)
offset = np.array(int(offset) + tokens.shape[1], dtype=np.int64)
tokens = np.array([[next_token]], dtype=np.int64)
print(detokenize(generated))
2. Apple Silicon (CoreML acceleration)
import onnxruntime as ort
# Use CoreML provider β accelerates via Apple Neural Engine / GPU
encoder = ort.InferenceSession(
"cohere-encoder.int8.onnx",
providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
"cohere-decoder.int8.onnx",
providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)
# Rest of the inference code is identical to the CPU example above.
# CoreML will automatically accelerate supported ops on the Neural Engine.
# Unsupported ops fall back to CPU transparently.
Tip: On Apple Silicon, even
CPUExecutionProvideris fast because ARM NEON vectorization handles INT8 ops natively. CoreML adds acceleration for the float ops via the Neural Engine.
3. NVIDIA GPU (CUDA)
pip install onnxruntime-gpu
import onnxruntime as ort
# Use CUDA provider
encoder = ort.InferenceSession(
"cohere-encoder.int8.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
"cohere-decoder.int8.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Rest of the inference code is identical.
# Tensors are transferred to GPU automatically by ONNX Runtime.
4. NVIDIA TensorRT (maximum GPU throughput)
pip install onnxruntime-gpu # TensorRT must also be installed system-wide
import onnxruntime as ort
# TensorRT will compile an optimized engine on first run (cached afterward)
encoder = ort.InferenceSession(
"cohere-encoder.int8.onnx",
providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
"cohere-decoder.int8.onnx",
providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Rest of the inference code is identical.
# First inference is slow (TensorRT engine compilation), subsequent runs are fast.
5. Windows GPU (DirectML β AMD / Intel / NVIDIA)
pip install onnxruntime-directml
import onnxruntime as ort
encoder = ort.InferenceSession(
"cohere-encoder.int8.onnx",
providers=["DmlExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
"cohere-decoder.int8.onnx",
providers=["DmlExecutionProvider", "CPUExecutionProvider"]
)
# Rest of the inference code is identical.
Evaluation Results
Tested on 30 samples each across 5 diverse English benchmarks (n=150 total).
FP32 vs INT8 β Word Error Rate
LibriSpeech test-clean (read speech, studio quality)
FP32 β 0.99%
INT8 β 0.99% Ξ 0.00%
VoxPopuli (parliament, formal speech)
FP32 ββββββ 12.16%
INT8 ββββββ 12.16% Ξ 0.00%
AMI IHM (meeting, conversational)
FP32 ββββββββ 16.24%
INT8 ββββββββ 15.38% Ξ -0.85%
People's Speech (diverse, crowd-sourced)
FP32 βββββββββββ 21.97%
INT8 βββββββββββ 22.42% Ξ +0.45%
Earnings22 (financial, spontaneous)
FP32 βββββββββββββ 26.38%
INT8 ββββββββββββββ 27.87% Ξ +1.49%
AVERAGE
FP32 ββββββββ 15.55%
INT8 ββββββββ 15.77% Ξ +0.22%
Full Metrics Table
| Dataset | Domain | FP32 WER | INT8 WER | ΞWER | FP32 CER | INT8 CER | ΞCER |
|---|---|---|---|---|---|---|---|
| LibriSpeech test-clean | Read speech | 0.99% | 0.99% | +0.00% | 0.37% | 0.37% | +0.00% |
| VoxPopuli | Parliament | 12.16% | 12.16% | +0.00% | 5.21% | 5.28% | +0.07% |
| AMI IHM | Meeting | 16.24% | 15.38% | -0.85% | 12.25% | 11.33% | -0.91% |
| People's Speech | Diverse | 21.97% | 22.42% | +0.45% | 9.21% | 9.38% | +0.17% |
| Earnings22 | Financial | 26.38% | 27.87% | +1.49% | 16.23% | 18.60% | +2.37% |
| Average | 15.55% | 15.77% | +0.22% | 8.65% | 8.99% | +0.34% |
Inference Speed (Apple M-series CPU)
| Dataset | FP32 | INT8 | Speedup |
|---|---|---|---|
| VoxPopuli | 1.1s | 0.9s | 1.2x |
| Earnings22 | 0.8s | 0.6s | 1.3x |
| AMI IHM | 0.3s | 0.2s | 1.5x |
| LibriSpeech | 1.0s | 0.8s | 1.3x |
| People's Speech | 0.7s | 0.6s | 1.2x |
| Average | 0.8s | 0.6s | 1.3x |
Model Size Comparison
Original (safetensors) ββββββββββββββββββββββββββββββββββββββ 3.80 GB (baseline)
FP32 ONNX ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 7.79 GB (+105%)
INT8 ONNX ββββββββββββββββββββββββββ 2.69 GB (-29%)
INT4 ONNX βββββββββββββββββββ 1.95 GB (-49%)
The ONNX FP32 export is larger than the original safetensors because it stores explicit cross-attention K/V projections for all decoder layers (sherpa-onnx Whisper tensor contract).
Language Tokens
This model supports 15 languages. Set the language token ID when decoding:
| Language | Code | Token ID |
|---|---|---|
| Arabic | ar | 28 |
| Chinese | zh | 50 |
| Dutch | nl | 60 |
| English | en | 62 |
| French | fr | 69 |
| German | de | 76 |
| Italian | it | 97 |
| Japanese | ja | 98 |
| Korean | ko | 110 |
| Polish | pl | 148 |
| Portuguese | pt | 149 |
| Russian | ru | 155 |
| Spanish | es | 169 |
| Turkish | tr | 186 |
| Vietnamese | vi | 194 |
# Example: transcribe French audio
FR_LANG_ID = 69
tokens = np.array([[DECODER_START_ID, FR_LANG_ID]], dtype=np.int64)
Full token list available in
tokens.txtβ look for<|en|>,<|fr|>, etc.
Architecture
The base model is a custom cohere_asr architecture (not Whisper) with:
- Encoder: 48-layer Conformer with ConvSubsampling frontend (1280 hidden dim, 8 heads)
- Decoder: 8-layer Transformer with cross-attention (1024 hidden dim, 8 heads)
- Vocab: 16,384 SentencePiece tokens
- Audio: 16 kHz, 128 mel bins, max 35 seconds
The ONNX export splits the model into two graphs:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Encoder β
β Input: mel (batch, 128, T) β
β Output: n_layer_cross_k (8, batch, T_enc, 1024) β
β n_layer_cross_v (8, batch, T_enc, 1024) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β cross-attention K/V
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β Decoder (autoregressive) β
β Input: tokens (batch, n_tokens) β
β self_k/v caches (8, batch, 8, 1024, 128) β
β cross_k/v from encoder β
β offset (scalar) β
β Output: logits (batch, n_tokens, 16384) β
β updated self_k/v caches β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quantization Details
Quantized using ONNX Runtime dynamic quantization:
- Method: Dynamic INT8 β weights stored as INT8, activations computed in FP32
- No calibration data required β weight-only quantization
- 149 Conv/BatchNorm nodes protected β the Conformer encoder's audio frontend is excluded from quantization (highly sensitive to numeric precision)
- External data format β encoder weights in separate
.onnx.datafile (protobuf 2 GB limit)
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input="cohere-encoder.onnx",
model_output="cohere-encoder.int8.onnx",
weight_type=QuantType.QInt8,
nodes_to_exclude=conv_and_batchnorm_nodes, # 149 nodes protected
use_external_data_format=True,
)
Limitations
- Max audio length: 35 seconds per clip. Longer audio must be chunked.
- Spontaneous speech: Higher WER on spontaneous/noisy speech (~27% on Earnings22). This reflects the base model's behavior, not a quantization artifact.
- Non-English: Evaluation was English-only. Other languages may behave differently under INT8.
- Greedy decoding: Examples use argmax decoding. Beam search is not implemented here.
- No GGUF: This is a Conformer encoder-decoder ASR model. GGUF/llama.cpp is for autoregressive LLMs and does not support this architecture.
Evaluation Setup
- Hardware: Apple M-series CPU, 24 GB unified memory
- Runtime: ONNX Runtime 1.19,
CPUExecutionProvider,ORT_ENABLE_ALL - Metrics: WER/CER via
jiwer, hypotheses lowercased - Samples: 30 per dataset (n=150 total), streamed from HuggingFace parquet
- Datasets:
- openslr/librispeech_asr
cleantest β read speech - facebook/voxpopuli
entest β parliament speeches - edinburghcstr/ami
ihmtest β meeting conversations - MLCommons/peoples_speech
cleantest β diverse crowd-sourced - distil-whisper/earnings22 test β financial calls
- openslr/librispeech_asr
Citation
@misc{cohere-transcribe-03-2026,
author = {CohereLabs},
title = {cohere-transcribe-03-2026},
year = {2026},
url = {https://huggingface.co/CohereLabs/cohere-transcribe-03-2026}
}
License
This model inherits the license of the base model: CC BY-NC 4.0. Non-commercial use only.
Model tree for vigneshlabs/cohere-transcribe-03-2026-int8-onnx
Base model
CohereLabs/cohere-transcribe-03-2026