Cohere Transcribe INT8 ONNX

INT8-quantised ONNX export of CohereLabs/cohere-transcribe-03-2026, a 2B parameter conformer-based encoder-decoder ASR model.

Origin

This ONNX export was originally created by the Handy STT application. The encoder and decoder are split into separate ONNX graphs with KV-cache support for efficient autoregressive decoding. The model weights are dynamically quantised to INT8.

Both loose files and the original tarball (cohere-int8.tar.gz) are provided so consumers can pick whichever is more convenient.

Model details

Property Value
Base model CohereLabs/cohere-transcribe-03-2026
Architecture Conformer encoder + Transformer decoder (autoregressive)
Parameters 2B
Quantisation Dynamic INT8
Licence Apache 2.0
Languages 14 (en, fr, de, it, es, pt, el, nl, pl, zh, ja, ko, vi, ar)
Mean WER 5.42% (Open ASR Leaderboard)

Files

cohere-encoder.int8.onnx       (3 MB graph definition)
cohere-encoder.int8.onnx.data  (2.6 GB external weights)
cohere-decoder.int8.onnx       (147 MB, weights inline)
tokens.txt                     (203 KB, 16384 tokens)

Total size: ~2.7 GB. A compressed tarball (cohere-int8.tar.gz, 1.7 GB) is also provided for single-download convenience.

SHA-256 checksums

8372ca6c8ff4db8b916ca3592f5c757a715e691b9edec751ba19b29fc854baf9  cohere-decoder.int8.onnx
58386cad715aa0ab30aaa118a479e43115380c114bd180178a0d110434991a54  cohere-encoder.int8.onnx
013ede043ae2480e3a9205cc34550d9686100cc682bacc90f702facdfbb93035  tokens.txt

ONNX graph signatures

Encoder

Takes raw 16kHz mono f32 audio samples directly (mel spectrogram computation is baked into the graph).

Direction Name Type Shape
Input audio float32 [batch, samples]
Output n_layer_cross_k float32 [8, batch, T_enc, 1024]
Output n_layer_cross_v float32 [8, batch, T_enc, 1024]

Decoder

Autoregressive token generation with self-attention KV cache.

Direction Name Type Shape
Input tokens int64 [batch, n_tokens]
Input in_n_layer_self_k_cache float32 [8, batch, 8, 1024, 128]
Input in_n_layer_self_v_cache float32 [8, batch, 8, 1024, 128]
Input n_layer_cross_k float32 [8, batch, T_enc, 1024]
Input n_layer_cross_v float32 [8, batch, T_enc, 1024]
Input offset int64 scalar
Output logits float32 [batch, n_tokens, 16384]
Output out_n_layer_self_k_cache float32 [8, batch, 8, 1024, 128]
Output out_n_layer_self_v_cache float32 [8, batch, 8, 1024, 128]

Architecture: 8 decoder layers, 8 attention heads, 128 head dim (1024 hidden), 16384 vocab size, max 1024 output tokens.

Decoder prompt format

The decoder is seeded with a prompt sequence of special tokens:

[<|startoftranscript|>, <|lang|>, <|pnc|>/<|nopnc|>, <|notimestamp|>, <|itn|>/<|noitn|>]
Token ID Purpose
<|startoftranscript|> 4 Start of sequence
<|en|> 62 Language (ISO 639-1, required)
<|pnc|> / <|nopnc|> 5 / 6 Punctuation and capitalisation on/off
<|notimestamp|> 11 Timestamps disabled (model does not support timestamps)
<|itn|> / <|noitn|> 8 / 9 Inverse text normalisation on/off
<|endoftext|> 3 EOS (stop decoding when emitted)

Language tokens are ISO 639-1 two-letter codes wrapped in <|...|>. The full list is in tokens.txt (IDs 22-220).

Inference loop

  1. Feed raw 16kHz mono f32 audio to encoder, get cross-attention KV pairs
  2. Initialise empty self-attention KV cache (zeros, shape [8, 1, 8, 1024, 128])
  3. Feed prompt tokens to decoder, get first predicted token via argmax on logits
  4. Loop: feed predicted token + updated KV cache to decoder, argmax, repeat until <|endoftext|> or 1024 tokens
  5. Decode output token IDs using tokens.txt (SentencePiece format, \u2581 = word boundary)
  6. Filter out any special tokens (IDs < 256 or <|...|> patterns) from the output

Token format

tokens.txt uses the format token_text token_id per line. Tokens use SentencePiece encoding where \u2581 (lower one eighth block) represents a word boundary/space.

Limitations

From the base model card:

  • Requires explicit language selection (no auto-detection)
  • Does not support timestamps or speaker diarisation
  • Eager to transcribe non-speech sounds; benefits from a VAD/noise gate upstream
  • Best performance when audio is single-language (limited code-switching support)

Usage with ONNX Runtime (Python)

import onnxruntime as ort
import numpy as np

encoder = ort.InferenceSession("cohere-encoder.int8.onnx")
decoder = ort.InferenceSession("cohere-decoder.int8.onnx")

# audio: 16kHz mono f32 numpy array
enc_out = encoder.run(None, {"audio": audio.reshape(1, -1)})
cross_k, cross_v = enc_out

# Prompt: startoftranscript, en, pnc, notimestamp, itn
prompt = np.array([[4, 62, 5, 11, 8]], dtype=np.int64)
cache_k = np.zeros((8, 1, 8, 1024, 128), dtype=np.float32)
cache_v = np.zeros((8, 1, 8, 1024, 128), dtype=np.float32)

# First decoder step
out = decoder.run(None, {
    "tokens": prompt,
    "in_n_layer_self_k_cache": cache_k,
    "in_n_layer_self_v_cache": cache_v,
    "n_layer_cross_k": cross_k,
    "n_layer_cross_v": cross_v,
    "offset": np.array(0, dtype=np.int64),
})
logits, cache_k, cache_v = out
next_token = int(np.argmax(logits[0, -1, :]))

# Continue decoding until EOS (token 3) ...
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for smcleod/cohere-transcribe-03-2026-int8

Quantized
(24)
this model