cohere-transcribe-03-2026 — INT8 ONNX

INT8 dynamic quantization of CohereLabs/cohere-transcribe-03-2026 (2B-parameter multilingual ASR) exported to ONNX.

Runs on CPU, Apple Silicon, and GPU via ONNX Runtime — no PyTorch required at inference time.

	Original (safetensors)	FP32 ONNX	INT8 ONNX	INT4 ONNX
Size	3.80 GB	7.79 GB	2.69 GB	1.95 GB
vs Original	baseline	+105%	-29%	-49%
Avg WER (5 datasets)	—	15.55%	15.77%	21.81%
ΔWER	—	—	+0.22%	+3.55%
Speed (M-series CPU)	—	0.8s/sample	0.6s/sample	1.4s/sample
Runtime	PyTorch	ONNX Runtime	ONNX Runtime	ONNX Runtime

Deployment Options

Platform	Supported	Provider	Notes
CPU (x86/ARM)	Yes	`CPUExecutionProvider`	Best general-purpose option. INT8 runs ~25% faster than FP32.
Apple Silicon (M1/M2/M3/M4)	Yes	`CoreMLExecutionProvider` or `CPUExecutionProvider`	Native ARM NEON acceleration. Unified memory means no CPU-GPU transfer cost.
NVIDIA GPU (CUDA)	Yes	`CUDAExecutionProvider`	Requires `onnxruntime-gpu`. Best for batch/server workloads.
NVIDIA TensorRT	Yes	`TensorrtExecutionProvider`	Maximum GPU throughput. Requires TensorRT installed.
DirectML (Windows GPU)	Yes	`DmlExecutionProvider`	AMD/Intel/NVIDIA GPUs on Windows via DirectX 12.
GGUF / llama.cpp	No	—	Not applicable. This is a Conformer ASR model, not an LLM. No GGUF conversion exists.
Web / WASM	Experimental	`WasmExecutionProvider`	Via onnxruntime-web. Model size may be prohibitive for browser use.

Quick Start

Install

pip install onnxruntime numpy soundfile transformers

For GPU support:

pip install onnxruntime-gpu  # replaces onnxruntime

For Apple Silicon CoreML acceleration:

pip install onnxruntime  # CoreML provider is included by default on macOS

Download

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="vigneshlabs/cohere-transcribe-03-2026-int8-onnx",
    local_dir="./cohere-int8",
)

Or with Git LFS:

git lfs install
git clone https://huggingface.co/vigneshlabs/cohere-transcribe-03-2026-int8-onnx

Files

cohere-encoder.int8.onnx          101 MB   Conformer encoder graph (INT8)
cohere-encoder.int8.onnx.data     2.4 GB   Encoder weights (external data)
cohere-decoder.int8.onnx          569 KB   Transformer decoder graph (INT8)
cohere-decoder.int8.onnx.data     146 MB   Decoder weights (external data)
tokens.txt                              203 KB   Vocabulary (16,384 tokens)

Total: 2.69 GB

Usage Examples

1. CPU Inference (any platform)

import numpy as np
import onnxruntime as ort
import soundfile as sf
from transformers import AutoProcessor

# ── Config ──────────────────────────────────────────────────────────
MODEL_DIR        = "./cohere-int8"
DECODER_START_ID = 13764
EOS_ID           = 3
EN_LANG_ID       = 62    # see Language Tokens section below
MAX_NEW_TOKENS   = 256

# ── Tokenizer ──────────────────────────────────────────────────────
id_to_token = {}
with open(f"{MODEL_DIR}/tokens.txt") as f:
    for line in f:
        parts = line.rstrip("\n").split(" ")
        id_to_token[int(parts[-1])] = " ".join(parts[:-1])

def detokenize(ids):
    pieces = [id_to_token.get(i, "") for i in ids
              if not (id_to_token.get(i, "").startswith("<|")
                      and id_to_token.get(i, "").endswith("|>"))]
    return "".join(pieces).replace("▁", " ").strip()

# ── Load processor (for mel spectrogram extraction) ────────────────
processor = AutoProcessor.from_pretrained(
    "CohereLabs/cohere-transcribe-03-2026", trust_remote_code=True
)

# ── Load ONNX sessions on CPU ─────────────────────────────────────
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

encoder = ort.InferenceSession(
    f"{MODEL_DIR}/cohere-encoder.int8.onnx", opts,
    providers=["CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    f"{MODEL_DIR}/cohere-decoder.int8.onnx", opts,
    providers=["CPUExecutionProvider"]
)

# ── Load & preprocess audio ───────────────────────────────────────
audio, sr = sf.read("audio.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)  # stereo → mono
mel = processor(
    audio.astype(np.float32), sampling_rate=sr, return_tensors="np"
)["input_features"].astype(np.float32)

# ── Encode ────────────────────────────────────────────────────────
cross_k, cross_v = encoder.run(
    ["n_layer_cross_k", "n_layer_cross_v"], {"mel": mel}
)

# ── Decode (greedy) ──────────────────────────────────────────────
n_layers = cross_k.shape[0]  # 8
self_k = np.zeros((n_layers, 1, 8, 1024, 128), dtype=np.float32)
self_v = np.zeros((n_layers, 1, 8, 1024, 128), dtype=np.float32)
tokens = np.array([[DECODER_START_ID, EN_LANG_ID]], dtype=np.int64)
offset = np.array(0, dtype=np.int64)
generated = []

for _ in range(MAX_NEW_TOKENS):
    logits, self_k, self_v = decoder.run(
        ["logits", "out_n_layer_self_k_cache", "out_n_layer_self_v_cache"],
        {
            "tokens": tokens,
            "in_n_layer_self_k_cache": self_k,
            "in_n_layer_self_v_cache": self_v,
            "n_layer_cross_k": cross_k,
            "n_layer_cross_v": cross_v,
            "offset": offset,
        },
    )
    next_token = int(np.argmax(logits[0, -1, :]))
    if next_token == EOS_ID:
        break
    generated.append(next_token)
    offset = np.array(int(offset) + tokens.shape[1], dtype=np.int64)
    tokens = np.array([[next_token]], dtype=np.int64)

print(detokenize(generated))

2. Apple Silicon (CoreML acceleration)

import onnxruntime as ort

# Use CoreML provider — accelerates via Apple Neural Engine / GPU
encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical to the CPU example above.
# CoreML will automatically accelerate supported ops on the Neural Engine.
# Unsupported ops fall back to CPU transparently.

Tip: On Apple Silicon, even CPUExecutionProvider is fast because ARM NEON vectorization handles INT8 ops natively. CoreML adds acceleration for the float ops via the Neural Engine.

3. NVIDIA GPU (CUDA)

pip install onnxruntime-gpu

import onnxruntime as ort

# Use CUDA provider
encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical.
# Tensors are transferred to GPU automatically by ONNX Runtime.

4. NVIDIA TensorRT (maximum GPU throughput)

pip install onnxruntime-gpu  # TensorRT must also be installed system-wide

import onnxruntime as ort

# TensorRT will compile an optimized engine on first run (cached afterward)
encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical.
# First inference is slow (TensorRT engine compilation), subsequent runs are fast.

5. Windows GPU (DirectML — AMD / Intel / NVIDIA)

pip install onnxruntime-directml

import onnxruntime as ort

encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["DmlExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["DmlExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical.

Evaluation Results

Tested on 30 samples each across 5 diverse English benchmarks (n=150 total).

FP32 vs INT8 — Word Error Rate

LibriSpeech test-clean (read speech, studio quality)
  FP32  █  0.99%
  INT8  █  0.99%   Δ  0.00%

VoxPopuli (parliament, formal speech)
  FP32  ██████  12.16%
  INT8  ██████  12.16%   Δ  0.00%

AMI IHM (meeting, conversational)
  FP32  ████████  16.24%
  INT8  ████████  15.38%   Δ -0.85%

People's Speech (diverse, crowd-sourced)
  FP32  ███████████  21.97%
  INT8  ███████████  22.42%   Δ +0.45%

Earnings22 (financial, spontaneous)
  FP32  █████████████  26.38%
  INT8  ██████████████  27.87%   Δ +1.49%

AVERAGE
  FP32  ████████  15.55%
  INT8  ████████  15.77%   Δ +0.22%

Full Metrics Table

Dataset	Domain	FP32 WER	INT8 WER	ΔWER	FP32 CER	INT8 CER	ΔCER
LibriSpeech test-clean	Read speech	0.99%	0.99%	+0.00%	0.37%	0.37%	+0.00%
VoxPopuli	Parliament	12.16%	12.16%	+0.00%	5.21%	5.28%	+0.07%
AMI IHM	Meeting	16.24%	15.38%	-0.85%	12.25%	11.33%	-0.91%
People's Speech	Diverse	21.97%	22.42%	+0.45%	9.21%	9.38%	+0.17%
Earnings22	Financial	26.38%	27.87%	+1.49%	16.23%	18.60%	+2.37%
Average		15.55%	15.77%	+0.22%	8.65%	8.99%	+0.34%

Inference Speed (Apple M-series CPU)

Dataset	FP32	INT8	Speedup
VoxPopuli	1.1s	0.9s	1.2x
Earnings22	0.8s	0.6s	1.3x
AMI IHM	0.3s	0.2s	1.5x
LibriSpeech	1.0s	0.8s	1.3x
People's Speech	0.7s	0.6s	1.2x
Average	0.8s	0.6s	1.3x

Model Size Comparison

Original (safetensors)  ██████████████████████████████████████  3.80 GB  (baseline)
FP32 ONNX               ████████████████████████████████████████████████████████████████████████████ 7.79 GB  (+105%)
INT8 ONNX               ██████████████████████████  2.69 GB  (-29%)
INT4 ONNX               ███████████████████  1.95 GB  (-49%)

The ONNX FP32 export is larger than the original safetensors because it stores explicit cross-attention K/V projections for all decoder layers (sherpa-onnx Whisper tensor contract).

Language Tokens

This model supports 15 languages. Set the language token ID when decoding:

Language	Code	Token ID
Arabic	ar	28
Chinese	zh	50
Dutch	nl	60
English	en	62
French	fr	69
German	de	76
Italian	it	97
Japanese	ja	98
Korean	ko	110
Polish	pl	148
Portuguese	pt	149
Russian	ru	155
Spanish	es	169
Turkish	tr	186
Vietnamese	vi	194

# Example: transcribe French audio
FR_LANG_ID = 69
tokens = np.array([[DECODER_START_ID, FR_LANG_ID]], dtype=np.int64)

Full token list available in tokens.txt — look for <|en|>, <|fr|>, etc.

Architecture

The base model is a custom cohere_asr architecture (not Whisper) with:

Encoder: 48-layer Conformer with ConvSubsampling frontend (1280 hidden dim, 8 heads)
Decoder: 8-layer Transformer with cross-attention (1024 hidden dim, 8 heads)
Vocab: 16,384 SentencePiece tokens
Audio: 16 kHz, 128 mel bins, max 35 seconds

The ONNX export splits the model into two graphs:

┌─────────────────────────────────────────────────────┐
│  Encoder                                             │
│  Input:  mel (batch, 128, T)                         │
│  Output: n_layer_cross_k (8, batch, T_enc, 1024)     │
│          n_layer_cross_v (8, batch, T_enc, 1024)     │
└──────────────────────┬──────────────────────────────┘
                       │ cross-attention K/V
┌──────────────────────▼──────────────────────────────┐
│  Decoder (autoregressive)                            │
│  Input:  tokens (batch, n_tokens)                    │
│          self_k/v caches (8, batch, 8, 1024, 128)    │
│          cross_k/v from encoder                      │
│          offset (scalar)                             │
│  Output: logits (batch, n_tokens, 16384)             │
│          updated self_k/v caches                     │
└─────────────────────────────────────────────────────┘

Quantization Details

Quantized using ONNX Runtime dynamic quantization:

Method: Dynamic INT8 — weights stored as INT8, activations computed in FP32
No calibration data required — weight-only quantization
149 Conv/BatchNorm nodes protected — the Conformer encoder's audio frontend is excluded from quantization (highly sensitive to numeric precision)
External data format — encoder weights in separate .onnx.data file (protobuf 2 GB limit)

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="cohere-encoder.onnx",
    model_output="cohere-encoder.int8.onnx",
    weight_type=QuantType.QInt8,
    nodes_to_exclude=conv_and_batchnorm_nodes,  # 149 nodes protected
    use_external_data_format=True,
)

Limitations

Max audio length: 35 seconds per clip. Longer audio must be chunked.
Spontaneous speech: Higher WER on spontaneous/noisy speech (~27% on Earnings22). This reflects the base model's behavior, not a quantization artifact.
Non-English: Evaluation was English-only. Other languages may behave differently under INT8.
Greedy decoding: Examples use argmax decoding. Beam search is not implemented here.
No GGUF: This is a Conformer encoder-decoder ASR model. GGUF/llama.cpp is for autoregressive LLMs and does not support this architecture.

Evaluation Setup

Hardware: Apple M-series CPU, 24 GB unified memory
Runtime: ONNX Runtime 1.19, CPUExecutionProvider, ORT_ENABLE_ALL
Metrics: WER/CER via jiwer, hypotheses lowercased
Samples: 30 per dataset (n=150 total), streamed from HuggingFace parquet
Datasets:
- openslr/librispeech_asr clean test — read speech
- facebook/voxpopuli en test — parliament speeches
- edinburghcstr/ami ihm test — meeting conversations
- MLCommons/peoples_speech clean test — diverse crowd-sourced
- distil-whisper/earnings22 test — financial calls

Citation

@misc{cohere-transcribe-03-2026,
  author    = {CohereLabs},
  title     = {cohere-transcribe-03-2026},
  year      = {2026},
  url       = {https://huggingface.co/CohereLabs/cohere-transcribe-03-2026}
}

License

This model inherits the license of the base model: CC BY-NC 4.0. Non-commercial use only.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for vigneshlabs/cohere-transcribe-03-2026-int8-onnx

Base model

CohereLabs/cohere-transcribe-03-2026

Quantized

(23)

this model