Cohere Transcribe INT8 ONNX
INT8-quantised ONNX export of CohereLabs/cohere-transcribe-03-2026, a 2B parameter conformer-based encoder-decoder ASR model.
Origin
This ONNX export was originally created by the Handy STT application. The encoder and decoder are split into separate ONNX graphs with KV-cache support for efficient autoregressive decoding. The model weights are dynamically quantised to INT8.
Both loose files and the original tarball (cohere-int8.tar.gz) are provided so consumers can pick whichever is more convenient.
Model details
| Property | Value |
|---|---|
| Base model | CohereLabs/cohere-transcribe-03-2026 |
| Architecture | Conformer encoder + Transformer decoder (autoregressive) |
| Parameters | 2B |
| Quantisation | Dynamic INT8 |
| Licence | Apache 2.0 |
| Languages | 14 (en, fr, de, it, es, pt, el, nl, pl, zh, ja, ko, vi, ar) |
| Mean WER | 5.42% (Open ASR Leaderboard) |
Files
cohere-encoder.int8.onnx (3 MB graph definition)
cohere-encoder.int8.onnx.data (2.6 GB external weights)
cohere-decoder.int8.onnx (147 MB, weights inline)
tokens.txt (203 KB, 16384 tokens)
Total size: ~2.7 GB. A compressed tarball (cohere-int8.tar.gz, 1.7 GB) is also provided for single-download convenience.
SHA-256 checksums
8372ca6c8ff4db8b916ca3592f5c757a715e691b9edec751ba19b29fc854baf9 cohere-decoder.int8.onnx
58386cad715aa0ab30aaa118a479e43115380c114bd180178a0d110434991a54 cohere-encoder.int8.onnx
013ede043ae2480e3a9205cc34550d9686100cc682bacc90f702facdfbb93035 tokens.txt
ONNX graph signatures
Encoder
Takes raw 16kHz mono f32 audio samples directly (mel spectrogram computation is baked into the graph).
| Direction | Name | Type | Shape |
|---|---|---|---|
| Input | audio |
float32 | [batch, samples] |
| Output | n_layer_cross_k |
float32 | [8, batch, T_enc, 1024] |
| Output | n_layer_cross_v |
float32 | [8, batch, T_enc, 1024] |
Decoder
Autoregressive token generation with self-attention KV cache.
| Direction | Name | Type | Shape |
|---|---|---|---|
| Input | tokens |
int64 | [batch, n_tokens] |
| Input | in_n_layer_self_k_cache |
float32 | [8, batch, 8, 1024, 128] |
| Input | in_n_layer_self_v_cache |
float32 | [8, batch, 8, 1024, 128] |
| Input | n_layer_cross_k |
float32 | [8, batch, T_enc, 1024] |
| Input | n_layer_cross_v |
float32 | [8, batch, T_enc, 1024] |
| Input | offset |
int64 | scalar |
| Output | logits |
float32 | [batch, n_tokens, 16384] |
| Output | out_n_layer_self_k_cache |
float32 | [8, batch, 8, 1024, 128] |
| Output | out_n_layer_self_v_cache |
float32 | [8, batch, 8, 1024, 128] |
Architecture: 8 decoder layers, 8 attention heads, 128 head dim (1024 hidden), 16384 vocab size, max 1024 output tokens.
Decoder prompt format
The decoder is seeded with a prompt sequence of special tokens:
[<|startoftranscript|>, <|lang|>, <|pnc|>/<|nopnc|>, <|notimestamp|>, <|itn|>/<|noitn|>]
| Token | ID | Purpose |
|---|---|---|
<|startoftranscript|> |
4 | Start of sequence |
<|en|> |
62 | Language (ISO 639-1, required) |
<|pnc|> / <|nopnc|> |
5 / 6 | Punctuation and capitalisation on/off |
<|notimestamp|> |
11 | Timestamps disabled (model does not support timestamps) |
<|itn|> / <|noitn|> |
8 / 9 | Inverse text normalisation on/off |
<|endoftext|> |
3 | EOS (stop decoding when emitted) |
Language tokens are ISO 639-1 two-letter codes wrapped in <|...|>. The full list is in tokens.txt (IDs 22-220).
Inference loop
- Feed raw 16kHz mono f32 audio to encoder, get cross-attention KV pairs
- Initialise empty self-attention KV cache (zeros, shape
[8, 1, 8, 1024, 128]) - Feed prompt tokens to decoder, get first predicted token via argmax on logits
- Loop: feed predicted token + updated KV cache to decoder, argmax, repeat until
<|endoftext|>or 1024 tokens - Decode output token IDs using
tokens.txt(SentencePiece format,\u2581= word boundary) - Filter out any special tokens (IDs < 256 or
<|...|>patterns) from the output
Token format
tokens.txt uses the format token_text token_id per line. Tokens use SentencePiece encoding where \u2581 (lower one eighth block) represents a word boundary/space.
Limitations
From the base model card:
- Requires explicit language selection (no auto-detection)
- Does not support timestamps or speaker diarisation
- Eager to transcribe non-speech sounds; benefits from a VAD/noise gate upstream
- Best performance when audio is single-language (limited code-switching support)
Usage with ONNX Runtime (Python)
import onnxruntime as ort
import numpy as np
encoder = ort.InferenceSession("cohere-encoder.int8.onnx")
decoder = ort.InferenceSession("cohere-decoder.int8.onnx")
# audio: 16kHz mono f32 numpy array
enc_out = encoder.run(None, {"audio": audio.reshape(1, -1)})
cross_k, cross_v = enc_out
# Prompt: startoftranscript, en, pnc, notimestamp, itn
prompt = np.array([[4, 62, 5, 11, 8]], dtype=np.int64)
cache_k = np.zeros((8, 1, 8, 1024, 128), dtype=np.float32)
cache_v = np.zeros((8, 1, 8, 1024, 128), dtype=np.float32)
# First decoder step
out = decoder.run(None, {
"tokens": prompt,
"in_n_layer_self_k_cache": cache_k,
"in_n_layer_self_v_cache": cache_v,
"n_layer_cross_k": cross_k,
"n_layer_cross_v": cross_v,
"offset": np.array(0, dtype=np.int64),
})
logits, cache_k, cache_v = out
next_token = int(np.argmax(logits[0, -1, :]))
# Continue decoding until EOS (token 3) ...
Model tree for smcleod/cohere-transcribe-03-2026-int8
Base model
CohereLabs/cohere-transcribe-03-2026