Cohere Transcribe INT8 ONNX

INT8-quantised ONNX export of CohereLabs/cohere-transcribe-03-2026, a 2B parameter conformer-based encoder-decoder ASR model.

Origin

This ONNX export was originally created by the Handy STT application. The encoder and decoder are split into separate ONNX graphs with KV-cache support for efficient autoregressive decoding. The model weights are dynamically quantised to INT8.

Both loose files and the original tarball (cohere-int8.tar.gz) are provided so consumers can pick whichever is more convenient.

Model details

Property	Value
Base model	CohereLabs/cohere-transcribe-03-2026
Architecture	Conformer encoder + Transformer decoder (autoregressive)
Parameters	2B
Quantisation	Dynamic INT8
Licence	Apache 2.0
Languages	14 (en, fr, de, it, es, pt, el, nl, pl, zh, ja, ko, vi, ar)
Mean WER	5.42% (Open ASR Leaderboard)

Files

cohere-encoder.int8.onnx       (3 MB graph definition)
cohere-encoder.int8.onnx.data  (2.6 GB external weights)
cohere-decoder.int8.onnx       (147 MB, weights inline)
tokens.txt                     (203 KB, 16384 tokens)

Total size: ~2.7 GB. A compressed tarball (cohere-int8.tar.gz, 1.7 GB) is also provided for single-download convenience.

SHA-256 checksums

8372ca6c8ff4db8b916ca3592f5c757a715e691b9edec751ba19b29fc854baf9  cohere-decoder.int8.onnx
58386cad715aa0ab30aaa118a479e43115380c114bd180178a0d110434991a54  cohere-encoder.int8.onnx
013ede043ae2480e3a9205cc34550d9686100cc682bacc90f702facdfbb93035  tokens.txt

ONNX graph signatures

Encoder

Takes raw 16kHz mono f32 audio samples directly (mel spectrogram computation is baked into the graph).

Direction	Name	Type	Shape
Input	`audio`	float32	`[batch, samples]`
Output	`n_layer_cross_k`	float32	`[8, batch, T_enc, 1024]`
Output	`n_layer_cross_v`	float32	`[8, batch, T_enc, 1024]`

Decoder

Autoregressive token generation with self-attention KV cache.

Direction	Name	Type	Shape
Input	`tokens`	int64	`[batch, n_tokens]`
Input	`in_n_layer_self_k_cache`	float32	`[8, batch, 8, 1024, 128]`
Input	`in_n_layer_self_v_cache`	float32	`[8, batch, 8, 1024, 128]`
Input	`n_layer_cross_k`	float32	`[8, batch, T_enc, 1024]`
Input	`n_layer_cross_v`	float32	`[8, batch, T_enc, 1024]`
Input	`offset`	int64	scalar
Output	`logits`	float32	`[batch, n_tokens, 16384]`
Output	`out_n_layer_self_k_cache`	float32	`[8, batch, 8, 1024, 128]`
Output	`out_n_layer_self_v_cache`	float32	`[8, batch, 8, 1024, 128]`

Architecture: 8 decoder layers, 8 attention heads, 128 head dim (1024 hidden), 16384 vocab size, max 1024 output tokens.

Decoder prompt format

The decoder is seeded with a prompt sequence of special tokens:

[<|startoftranscript|>, <|lang|>, <|pnc|>/<|nopnc|>, <|notimestamp|>, <|itn|>/<|noitn|>]

Token	ID	Purpose
`<\|startoftranscript\|>`	4	Start of sequence
`<\|en\|>`	62	Language (ISO 639-1, required)
`<\|pnc\|>` / `<\|nopnc\|>`	5 / 6	Punctuation and capitalisation on/off
`<\|notimestamp\|>`	11	Timestamps disabled (model does not support timestamps)
`<\|itn\|>` / `<\|noitn\|>`	8 / 9	Inverse text normalisation on/off
`<\|endoftext\|>`	3	EOS (stop decoding when emitted)

Language tokens are ISO 639-1 two-letter codes wrapped in <|...|>. The full list is in tokens.txt (IDs 22-220).

Inference loop

Feed raw 16kHz mono f32 audio to encoder, get cross-attention KV pairs
Initialise empty self-attention KV cache (zeros, shape [8, 1, 8, 1024, 128])
Feed prompt tokens to decoder, get first predicted token via argmax on logits
Loop: feed predicted token + updated KV cache to decoder, argmax, repeat until <|endoftext|> or 1024 tokens
Decode output token IDs using tokens.txt (SentencePiece format, \u2581 = word boundary)
Filter out any special tokens (IDs < 256 or <|...|> patterns) from the output

Token format

tokens.txt uses the format token_text token_id per line. Tokens use SentencePiece encoding where \u2581 (lower one eighth block) represents a word boundary/space.

Limitations

From the base model card:

Requires explicit language selection (no auto-detection)
Does not support timestamps or speaker diarisation
Eager to transcribe non-speech sounds; benefits from a VAD/noise gate upstream
Best performance when audio is single-language (limited code-switching support)

Usage with ONNX Runtime (Python)

import onnxruntime as ort
import numpy as np

encoder = ort.InferenceSession("cohere-encoder.int8.onnx")
decoder = ort.InferenceSession("cohere-decoder.int8.onnx")

# audio: 16kHz mono f32 numpy array
enc_out = encoder.run(None, {"audio": audio.reshape(1, -1)})
cross_k, cross_v = enc_out

# Prompt: startoftranscript, en, pnc, notimestamp, itn
prompt = np.array([[4, 62, 5, 11, 8]], dtype=np.int64)
cache_k = np.zeros((8, 1, 8, 1024, 128), dtype=np.float32)
cache_v = np.zeros((8, 1, 8, 1024, 128), dtype=np.float32)

# First decoder step
out = decoder.run(None, {
    "tokens": prompt,
    "in_n_layer_self_k_cache": cache_k,
    "in_n_layer_self_v_cache": cache_v,
    "n_layer_cross_k": cross_k,
    "n_layer_cross_v": cross_v,
    "offset": np.array(0, dtype=np.int64),
})
logits, cache_k, cache_v = out
next_token = int(np.argmax(logits[0, -1, :]))

# Continue decoding until EOS (token 3) ...

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for smcleod/cohere-transcribe-03-2026-int8

Base model

CohereLabs/cohere-transcribe-03-2026

Quantized

(24)

this model