Cohere Transcribe — ONNX INT8 (weight-only, MatMulNBits)

Weight-only INT8 ONNX conversion of CohereLabs/cohere-transcribe-03-2026, the open-source 2 B-parameter ASR model that ranks #1 on the Open ASR Leaderboard (avg WER 5.42, March 2026).

This is a sibling repo to cstr/cohere-transcribe-onnx-int4 — same architecture, same baked-in feature extraction, just 8-bit weights instead of 4-bit. Use INT8 if you want a slightly larger model with marginally better numerical headroom; use INT4 if you want the smallest binary and the fastest cold load.

Both follow the architectural adaptations pioneered by Tristan Ripke in his original INT8 export, with one key difference: this build uses the streaming MatMulNBitsQuantizer (the same path the INT4 repo uses), which produces a smaller graph and a much faster cold load.

What's included

File	Size	Description
`cohere-encoder.int8.onnx`	5.9 MB	Encoder graph
`cohere-encoder.int8.onnx.data`	2.7 GB	Encoder weights (INT8, block_size=32, symmetric)
`cohere-decoder.int8.onnx`	0.5 MB	Decoder graph
`cohere-decoder.int8.onnx.data`	213 MB	Decoder weights (INT8, block_size=32, symmetric)
`tokens.txt`	219 KB	16 384-entry vocabulary

Total: ~2.9 GB.

How this differs from the existing INT8 exports

There are now three INT8 ONNX builds of this model in the wild:

Build	Quantiser	Encoder weights	Cold load (5.4 s clip)	Inference (5.4 s clip)
Tristan's original	`quantize_dynamic` (full-graph)	2.1 GB	26.5 s	9.8 s
This repo (`MatMulNBits`, bs=32)	`MatMulNBitsQuantizer` (`bits=8`)	2.7 GB	17.5 s	9.4 s
`onnx-final` `bs=128` (not published)	`MatMulNBitsQuantizer` (`bits=8`)	2.6 GB	22.2 s	11.2 s

Numbers measured with onnxruntime 1.23.2, CPU-only, 8 threads, on a Cascade Lake-class VPS. Audio: voxpopuli_test_en_demo.wav (5.4 s, 16 kHz mono).

Why the streaming quantiser is better: the original quantize_dynamic path materialises the entire encoder weight tensor (~7 GB at F32) in RAM, which OOMs on small machines. MatMulNBitsQuantizer streams via the external-data format and runs on the same 8 GB box that exports the INT4 build.

Why block_size=32 beats bs=128: smaller blocks mean more scale factors but better cache locality for the quantised weights. On x86 with VNNI the bs=32 layout is about 22 % faster end-to-end.

Comparison vs INT4 sibling

	INT4 (cstr/cohere-transcribe-onnx-int4)	INT8 (this repo)
Total size	1.94 GB	2.91 GB
Encoder weights	1.8 GB	2.7 GB
Cold load	7.5 s	17.5 s
Inference (5.4 s)	9.7 s	9.4 s
Total wall (5.4 s)	17.2 s	27.0 s

On a CPU with VNNI (Cascade Lake / Ice Lake / Zen4), INT4 is actually slightly faster end-to-end than INT8 in the MatMulNBits format because the INT4 weights pack 2× more elements per cache line and the same number of vpdpbusd instructions process them. The main reason to ship INT8 alongside INT4 is numerical robustness: 8-bit symmetric quantisation has about 16× the dynamic range of 4-bit, which matters for languages and accents where the 4-bit version drifts.

Quick Start (Python)

pip install onnxruntime numpy soundfile librosa

import onnxruntime as ort
import numpy as np
import librosa

# Load audio (must be 16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)

# Load models
enc = ort.InferenceSession("cohere-encoder.int8.onnx")
dec = ort.InferenceSession("cohere-decoder.int8.onnx")

# Load tokens
tokens = {}
with open("tokens.txt", "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().rsplit(" ", 1)
        if len(parts) == 2:
            tokens[int(parts[1])] = parts[0]
token_to_id = {v: k for k, v in tokens.items()}

# Build prompt
prompt_ids = [token_to_id[t] for t in [
    "<|startofcontext|>", "<|startoftranscript|>", "<|emo:undefined|>",
    "<|en|>", "<|en|>", "<|pnc|>", "<|noitn|>", "<|notimestamp|>", "<|nodiarize|>"
]]

# Run encoder
cross_k, cross_v = enc.run(None, {"audio": audio.reshape(1, -1).astype(np.float32)})

# Run decoder (autoregressive greedy decoding)
N_LAYERS, HEADS, HEAD_DIM, MAX_CTX = 8, 8, 128, 1024
self_k = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
self_v = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
eos_id = token_to_id["<|endoftext|>"]

generated = list(prompt_ids)
current = np.array([prompt_ids], dtype=np.int64)
offset = np.array(0, dtype=np.int64)

for _ in range(256):
    logits, self_k, self_v = dec.run(None, {
        "tokens": current, "in_n_layer_self_k_cache": self_k,
        "in_n_layer_self_v_cache": self_v, "n_layer_cross_k": cross_k,
        "n_layer_cross_v": cross_v, "offset": offset,
    })
    next_id = int(np.argmax(logits[0, -1, :]))
    if next_id == eos_id:
        break
    generated.append(next_id)
    offset = np.array(int(offset) + current.shape[1], dtype=np.int64)
    current = np.array([[next_id]], dtype=np.int64)

text = "".join(
    tokens.get(t, "").replace("\u2581", " ")
    for t in generated[len(prompt_ids):]
    if not tokens.get(t, "").startswith("<|")
).strip()
print(text)

How this was made

from onnxruntime.quantization import matmul_nbits_quantizer, quant_utils

config = matmul_nbits_quantizer.DefaultWeightOnlyQuantConfig(
    block_size=32,
    is_symmetric=True,
    accuracy_level=4,
    quant_format=quant_utils.QuantFormat.QOperator,
    op_types_to_quantize=("MatMul",),
    bits=8,
)

model = quant_utils.load_model_with_shape_infer("cohere-encoder.onnx")
quantizer = matmul_nbits_quantizer.MatMulNBitsQuantizer(model, algo_config=config)
quantizer.process()
quantizer.model.save_model_to_file("cohere-encoder.int8.onnx", True)

Attribution

Original model: CohereLabs/cohere-transcribe-03-2026 (Apache 2.0).
ONNX adaptation design: Tristan Ripke — baked-in mel filterbank, cross-attention K/V pre-computed for all 8 decoder layers, "Whisper-style" tensor contract.
INT4 sibling export: cstr/cohere-transcribe-onnx-int4.
This INT8 build: community contribution, weight-only INT8 via MatMulNBitsQuantizer(bits=8).

C++ runtime (ggml + GGUF): CrispStrobe/cohere-whisper.cpp — pure C++ inference with VAD, word-level CTC forced alignment, and quantised GGUF models. On the same hardware as the benchmarks above, the ggml Q4_K runtime hits ~15 s wall (no cold-load penalty thanks to mmap), tied with the ONNX paths and faster end-to-end.
Pre-converted GGUF weights: cstr/cohere-transcribe-03-2026-GGUF.

License

Apache 2.0 (inherited from the base model).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cstr/cohere-transcribe-onnx-int8

Base model

CohereLabs/cohere-transcribe-03-2026

Quantized

(24)

this model

cstr
/

cohere-transcribe-onnx-int8