cohere-transcribe-03-2026 β€” INT8 ONNX

INT8 dynamic quantization of CohereLabs/cohere-transcribe-03-2026 (2B-parameter multilingual ASR) exported to ONNX.

Runs on CPU, Apple Silicon, and GPU via ONNX Runtime β€” no PyTorch required at inference time.

Original (safetensors) FP32 ONNX INT8 ONNX INT4 ONNX
Size 3.80 GB 7.79 GB 2.69 GB 1.95 GB
vs Original baseline +105% -29% -49%
Avg WER (5 datasets) β€” 15.55% 15.77% 21.81%
Ξ”WER β€” β€” +0.22% +3.55%
Speed (M-series CPU) β€” 0.8s/sample 0.6s/sample 1.4s/sample
Runtime PyTorch ONNX Runtime ONNX Runtime ONNX Runtime

Deployment Options

Platform Supported Provider Notes
CPU (x86/ARM) Yes CPUExecutionProvider Best general-purpose option. INT8 runs ~25% faster than FP32.
Apple Silicon (M1/M2/M3/M4) Yes CoreMLExecutionProvider or CPUExecutionProvider Native ARM NEON acceleration. Unified memory means no CPU-GPU transfer cost.
NVIDIA GPU (CUDA) Yes CUDAExecutionProvider Requires onnxruntime-gpu. Best for batch/server workloads.
NVIDIA TensorRT Yes TensorrtExecutionProvider Maximum GPU throughput. Requires TensorRT installed.
DirectML (Windows GPU) Yes DmlExecutionProvider AMD/Intel/NVIDIA GPUs on Windows via DirectX 12.
GGUF / llama.cpp No β€” Not applicable. This is a Conformer ASR model, not an LLM. No GGUF conversion exists.
Web / WASM Experimental WasmExecutionProvider Via onnxruntime-web. Model size may be prohibitive for browser use.

Quick Start

Install

pip install onnxruntime numpy soundfile transformers

For GPU support:

pip install onnxruntime-gpu  # replaces onnxruntime

For Apple Silicon CoreML acceleration:

pip install onnxruntime  # CoreML provider is included by default on macOS

Download

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="vigneshlabs/cohere-transcribe-03-2026-int8-onnx",
    local_dir="./cohere-int8",
)

Or with Git LFS:

git lfs install
git clone https://huggingface.co/vigneshlabs/cohere-transcribe-03-2026-int8-onnx

Files

cohere-encoder.int8.onnx          101 MB   Conformer encoder graph (INT8)
cohere-encoder.int8.onnx.data     2.4 GB   Encoder weights (external data)
cohere-decoder.int8.onnx          569 KB   Transformer decoder graph (INT8)
cohere-decoder.int8.onnx.data     146 MB   Decoder weights (external data)
tokens.txt                              203 KB   Vocabulary (16,384 tokens)

Total: 2.69 GB


Usage Examples

1. CPU Inference (any platform)

import numpy as np
import onnxruntime as ort
import soundfile as sf
from transformers import AutoProcessor

# ── Config ──────────────────────────────────────────────────────────
MODEL_DIR        = "./cohere-int8"
DECODER_START_ID = 13764
EOS_ID           = 3
EN_LANG_ID       = 62    # see Language Tokens section below
MAX_NEW_TOKENS   = 256

# ── Tokenizer ──────────────────────────────────────────────────────
id_to_token = {}
with open(f"{MODEL_DIR}/tokens.txt") as f:
    for line in f:
        parts = line.rstrip("\n").split(" ")
        id_to_token[int(parts[-1])] = " ".join(parts[:-1])

def detokenize(ids):
    pieces = [id_to_token.get(i, "") for i in ids
              if not (id_to_token.get(i, "").startswith("<|")
                      and id_to_token.get(i, "").endswith("|>"))]
    return "".join(pieces).replace("▁", " ").strip()

# ── Load processor (for mel spectrogram extraction) ────────────────
processor = AutoProcessor.from_pretrained(
    "CohereLabs/cohere-transcribe-03-2026", trust_remote_code=True
)

# ── Load ONNX sessions on CPU ─────────────────────────────────────
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

encoder = ort.InferenceSession(
    f"{MODEL_DIR}/cohere-encoder.int8.onnx", opts,
    providers=["CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    f"{MODEL_DIR}/cohere-decoder.int8.onnx", opts,
    providers=["CPUExecutionProvider"]
)

# ── Load & preprocess audio ───────────────────────────────────────
audio, sr = sf.read("audio.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)  # stereo β†’ mono
mel = processor(
    audio.astype(np.float32), sampling_rate=sr, return_tensors="np"
)["input_features"].astype(np.float32)

# ── Encode ────────────────────────────────────────────────────────
cross_k, cross_v = encoder.run(
    ["n_layer_cross_k", "n_layer_cross_v"], {"mel": mel}
)

# ── Decode (greedy) ──────────────────────────────────────────────
n_layers = cross_k.shape[0]  # 8
self_k = np.zeros((n_layers, 1, 8, 1024, 128), dtype=np.float32)
self_v = np.zeros((n_layers, 1, 8, 1024, 128), dtype=np.float32)
tokens = np.array([[DECODER_START_ID, EN_LANG_ID]], dtype=np.int64)
offset = np.array(0, dtype=np.int64)
generated = []

for _ in range(MAX_NEW_TOKENS):
    logits, self_k, self_v = decoder.run(
        ["logits", "out_n_layer_self_k_cache", "out_n_layer_self_v_cache"],
        {
            "tokens": tokens,
            "in_n_layer_self_k_cache": self_k,
            "in_n_layer_self_v_cache": self_v,
            "n_layer_cross_k": cross_k,
            "n_layer_cross_v": cross_v,
            "offset": offset,
        },
    )
    next_token = int(np.argmax(logits[0, -1, :]))
    if next_token == EOS_ID:
        break
    generated.append(next_token)
    offset = np.array(int(offset) + tokens.shape[1], dtype=np.int64)
    tokens = np.array([[next_token]], dtype=np.int64)

print(detokenize(generated))

2. Apple Silicon (CoreML acceleration)

import onnxruntime as ort

# Use CoreML provider β€” accelerates via Apple Neural Engine / GPU
encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical to the CPU example above.
# CoreML will automatically accelerate supported ops on the Neural Engine.
# Unsupported ops fall back to CPU transparently.

Tip: On Apple Silicon, even CPUExecutionProvider is fast because ARM NEON vectorization handles INT8 ops natively. CoreML adds acceleration for the float ops via the Neural Engine.

3. NVIDIA GPU (CUDA)

pip install onnxruntime-gpu
import onnxruntime as ort

# Use CUDA provider
encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical.
# Tensors are transferred to GPU automatically by ONNX Runtime.

4. NVIDIA TensorRT (maximum GPU throughput)

pip install onnxruntime-gpu  # TensorRT must also be installed system-wide
import onnxruntime as ort

# TensorRT will compile an optimized engine on first run (cached afterward)
encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical.
# First inference is slow (TensorRT engine compilation), subsequent runs are fast.

5. Windows GPU (DirectML β€” AMD / Intel / NVIDIA)

pip install onnxruntime-directml
import onnxruntime as ort

encoder = ort.InferenceSession(
    "cohere-encoder.int8.onnx",
    providers=["DmlExecutionProvider", "CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    "cohere-decoder.int8.onnx",
    providers=["DmlExecutionProvider", "CPUExecutionProvider"]
)

# Rest of the inference code is identical.

Evaluation Results

Tested on 30 samples each across 5 diverse English benchmarks (n=150 total).

FP32 vs INT8 β€” Word Error Rate

LibriSpeech test-clean (read speech, studio quality)
  FP32  β–ˆ  0.99%
  INT8  β–ˆ  0.99%   Ξ”  0.00%

VoxPopuli (parliament, formal speech)
  FP32  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  12.16%
  INT8  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  12.16%   Ξ”  0.00%

AMI IHM (meeting, conversational)
  FP32  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  16.24%
  INT8  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  15.38%   Ξ” -0.85%

People's Speech (diverse, crowd-sourced)
  FP32  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  21.97%
  INT8  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  22.42%   Ξ” +0.45%

Earnings22 (financial, spontaneous)
  FP32  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  26.38%
  INT8  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  27.87%   Ξ” +1.49%

AVERAGE
  FP32  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  15.55%
  INT8  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  15.77%   Ξ” +0.22%

Full Metrics Table

Dataset Domain FP32 WER INT8 WER Ξ”WER FP32 CER INT8 CER Ξ”CER
LibriSpeech test-clean Read speech 0.99% 0.99% +0.00% 0.37% 0.37% +0.00%
VoxPopuli Parliament 12.16% 12.16% +0.00% 5.21% 5.28% +0.07%
AMI IHM Meeting 16.24% 15.38% -0.85% 12.25% 11.33% -0.91%
People's Speech Diverse 21.97% 22.42% +0.45% 9.21% 9.38% +0.17%
Earnings22 Financial 26.38% 27.87% +1.49% 16.23% 18.60% +2.37%
Average 15.55% 15.77% +0.22% 8.65% 8.99% +0.34%

Inference Speed (Apple M-series CPU)

Dataset FP32 INT8 Speedup
VoxPopuli 1.1s 0.9s 1.2x
Earnings22 0.8s 0.6s 1.3x
AMI IHM 0.3s 0.2s 1.5x
LibriSpeech 1.0s 0.8s 1.3x
People's Speech 0.7s 0.6s 1.2x
Average 0.8s 0.6s 1.3x

Model Size Comparison

Original (safetensors)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  3.80 GB  (baseline)
FP32 ONNX               β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 7.79 GB  (+105%)
INT8 ONNX               β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  2.69 GB  (-29%)
INT4 ONNX               β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  1.95 GB  (-49%)

The ONNX FP32 export is larger than the original safetensors because it stores explicit cross-attention K/V projections for all decoder layers (sherpa-onnx Whisper tensor contract).


Language Tokens

This model supports 15 languages. Set the language token ID when decoding:

Language Code Token ID
Arabic ar 28
Chinese zh 50
Dutch nl 60
English en 62
French fr 69
German de 76
Italian it 97
Japanese ja 98
Korean ko 110
Polish pl 148
Portuguese pt 149
Russian ru 155
Spanish es 169
Turkish tr 186
Vietnamese vi 194
# Example: transcribe French audio
FR_LANG_ID = 69
tokens = np.array([[DECODER_START_ID, FR_LANG_ID]], dtype=np.int64)

Full token list available in tokens.txt β€” look for <|en|>, <|fr|>, etc.


Architecture

The base model is a custom cohere_asr architecture (not Whisper) with:

  • Encoder: 48-layer Conformer with ConvSubsampling frontend (1280 hidden dim, 8 heads)
  • Decoder: 8-layer Transformer with cross-attention (1024 hidden dim, 8 heads)
  • Vocab: 16,384 SentencePiece tokens
  • Audio: 16 kHz, 128 mel bins, max 35 seconds

The ONNX export splits the model into two graphs:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Encoder                                             β”‚
β”‚  Input:  mel (batch, 128, T)                         β”‚
β”‚  Output: n_layer_cross_k (8, batch, T_enc, 1024)     β”‚
β”‚          n_layer_cross_v (8, batch, T_enc, 1024)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ cross-attention K/V
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Decoder (autoregressive)                            β”‚
β”‚  Input:  tokens (batch, n_tokens)                    β”‚
β”‚          self_k/v caches (8, batch, 8, 1024, 128)    β”‚
β”‚          cross_k/v from encoder                      β”‚
β”‚          offset (scalar)                             β”‚
β”‚  Output: logits (batch, n_tokens, 16384)             β”‚
β”‚          updated self_k/v caches                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quantization Details

Quantized using ONNX Runtime dynamic quantization:

  • Method: Dynamic INT8 β€” weights stored as INT8, activations computed in FP32
  • No calibration data required β€” weight-only quantization
  • 149 Conv/BatchNorm nodes protected β€” the Conformer encoder's audio frontend is excluded from quantization (highly sensitive to numeric precision)
  • External data format β€” encoder weights in separate .onnx.data file (protobuf 2 GB limit)
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="cohere-encoder.onnx",
    model_output="cohere-encoder.int8.onnx",
    weight_type=QuantType.QInt8,
    nodes_to_exclude=conv_and_batchnorm_nodes,  # 149 nodes protected
    use_external_data_format=True,
)

Limitations

  • Max audio length: 35 seconds per clip. Longer audio must be chunked.
  • Spontaneous speech: Higher WER on spontaneous/noisy speech (~27% on Earnings22). This reflects the base model's behavior, not a quantization artifact.
  • Non-English: Evaluation was English-only. Other languages may behave differently under INT8.
  • Greedy decoding: Examples use argmax decoding. Beam search is not implemented here.
  • No GGUF: This is a Conformer encoder-decoder ASR model. GGUF/llama.cpp is for autoregressive LLMs and does not support this architecture.

Evaluation Setup


Citation

@misc{cohere-transcribe-03-2026,
  author    = {CohereLabs},
  title     = {cohere-transcribe-03-2026},
  year      = {2026},
  url       = {https://huggingface.co/CohereLabs/cohere-transcribe-03-2026}
}

License

This model inherits the license of the base model: CC BY-NC 4.0. Non-commercial use only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vigneshlabs/cohere-transcribe-03-2026-int8-onnx

Quantized
(23)
this model