Cohere Transcribe β€” ONNX INT8 (weight-only, MatMulNBits)

Weight-only INT8 ONNX conversion of CohereLabs/cohere-transcribe-03-2026, the open-source 2 B-parameter ASR model that ranks #1 on the Open ASR Leaderboard (avg WER 5.42, March 2026).

This is a sibling repo to cstr/cohere-transcribe-onnx-int4 β€” same architecture, same baked-in feature extraction, just 8-bit weights instead of 4-bit. Use INT8 if you want a slightly larger model with marginally better numerical headroom; use INT4 if you want the smallest binary and the fastest cold load.

Both follow the architectural adaptations pioneered by Tristan Ripke in his original INT8 export, with one key difference: this build uses the streaming MatMulNBitsQuantizer (the same path the INT4 repo uses), which produces a smaller graph and a much faster cold load.

What's included

File Size Description
cohere-encoder.int8.onnx 5.9 MB Encoder graph
cohere-encoder.int8.onnx.data 2.7 GB Encoder weights (INT8, block_size=32, symmetric)
cohere-decoder.int8.onnx 0.5 MB Decoder graph
cohere-decoder.int8.onnx.data 213 MB Decoder weights (INT8, block_size=32, symmetric)
tokens.txt 219 KB 16 384-entry vocabulary

Total: ~2.9 GB.

How this differs from the existing INT8 exports

There are now three INT8 ONNX builds of this model in the wild:

Build Quantiser Encoder weights Cold load (5.4 s clip) Inference (5.4 s clip)
Tristan's original quantize_dynamic (full-graph) 2.1 GB 26.5 s 9.8 s
This repo (MatMulNBits, bs=32) MatMulNBitsQuantizer (bits=8) 2.7 GB 17.5 s 9.4 s
onnx-final bs=128 (not published) MatMulNBitsQuantizer (bits=8) 2.6 GB 22.2 s 11.2 s

Numbers measured with onnxruntime 1.23.2, CPU-only, 8 threads, on a Cascade Lake-class VPS. Audio: voxpopuli_test_en_demo.wav (5.4 s, 16 kHz mono).

Why the streaming quantiser is better: the original quantize_dynamic path materialises the entire encoder weight tensor (~7 GB at F32) in RAM, which OOMs on small machines. MatMulNBitsQuantizer streams via the external-data format and runs on the same 8 GB box that exports the INT4 build.

Why block_size=32 beats bs=128: smaller blocks mean more scale factors but better cache locality for the quantised weights. On x86 with VNNI the bs=32 layout is about 22 % faster end-to-end.

Comparison vs INT4 sibling

INT4 (cstr/cohere-transcribe-onnx-int4) INT8 (this repo)
Total size 1.94 GB 2.91 GB
Encoder weights 1.8 GB 2.7 GB
Cold load 7.5 s 17.5 s
Inference (5.4 s) 9.7 s 9.4 s
Total wall (5.4 s) 17.2 s 27.0 s

On a CPU with VNNI (Cascade Lake / Ice Lake / Zen4), INT4 is actually slightly faster end-to-end than INT8 in the MatMulNBits format because the INT4 weights pack 2Γ— more elements per cache line and the same number of vpdpbusd instructions process them. The main reason to ship INT8 alongside INT4 is numerical robustness: 8-bit symmetric quantisation has about 16Γ— the dynamic range of 4-bit, which matters for languages and accents where the 4-bit version drifts.

Quick Start (Python)

pip install onnxruntime numpy soundfile librosa
import onnxruntime as ort
import numpy as np
import librosa

# Load audio (must be 16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)

# Load models
enc = ort.InferenceSession("cohere-encoder.int8.onnx")
dec = ort.InferenceSession("cohere-decoder.int8.onnx")

# Load tokens
tokens = {}
with open("tokens.txt", "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().rsplit(" ", 1)
        if len(parts) == 2:
            tokens[int(parts[1])] = parts[0]
token_to_id = {v: k for k, v in tokens.items()}

# Build prompt
prompt_ids = [token_to_id[t] for t in [
    "<|startofcontext|>", "<|startoftranscript|>", "<|emo:undefined|>",
    "<|en|>", "<|en|>", "<|pnc|>", "<|noitn|>", "<|notimestamp|>", "<|nodiarize|>"
]]

# Run encoder
cross_k, cross_v = enc.run(None, {"audio": audio.reshape(1, -1).astype(np.float32)})

# Run decoder (autoregressive greedy decoding)
N_LAYERS, HEADS, HEAD_DIM, MAX_CTX = 8, 8, 128, 1024
self_k = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
self_v = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
eos_id = token_to_id["<|endoftext|>"]

generated = list(prompt_ids)
current = np.array([prompt_ids], dtype=np.int64)
offset = np.array(0, dtype=np.int64)

for _ in range(256):
    logits, self_k, self_v = dec.run(None, {
        "tokens": current, "in_n_layer_self_k_cache": self_k,
        "in_n_layer_self_v_cache": self_v, "n_layer_cross_k": cross_k,
        "n_layer_cross_v": cross_v, "offset": offset,
    })
    next_id = int(np.argmax(logits[0, -1, :]))
    if next_id == eos_id:
        break
    generated.append(next_id)
    offset = np.array(int(offset) + current.shape[1], dtype=np.int64)
    current = np.array([[next_id]], dtype=np.int64)

text = "".join(
    tokens.get(t, "").replace("\u2581", " ")
    for t in generated[len(prompt_ids):]
    if not tokens.get(t, "").startswith("<|")
).strip()
print(text)

How this was made

from onnxruntime.quantization import matmul_nbits_quantizer, quant_utils

config = matmul_nbits_quantizer.DefaultWeightOnlyQuantConfig(
    block_size=32,
    is_symmetric=True,
    accuracy_level=4,
    quant_format=quant_utils.QuantFormat.QOperator,
    op_types_to_quantize=("MatMul",),
    bits=8,
)

model = quant_utils.load_model_with_shape_infer("cohere-encoder.onnx")
quantizer = matmul_nbits_quantizer.MatMulNBitsQuantizer(model, algo_config=config)
quantizer.process()
quantizer.model.save_model_to_file("cohere-encoder.int8.onnx", True)

Attribution

Related

  • C++ runtime (ggml + GGUF): CrispStrobe/cohere-whisper.cpp β€” pure C++ inference with VAD, word-level CTC forced alignment, and quantised GGUF models. On the same hardware as the benchmarks above, the ggml Q4_K runtime hits ~15 s wall (no cold-load penalty thanks to mmap), tied with the ONNX paths and faster end-to-end.
  • Pre-converted GGUF weights: cstr/cohere-transcribe-03-2026-GGUF.

License

Apache 2.0 (inherited from the base model).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/cohere-transcribe-onnx-int8

Quantized
(24)
this model