Cohere Transcribe - ONNX

ONNX conversion of CohereLabs/cohere-transcribe-03-2026, a 2B parameter Conformer-based speech recognition model supporting 14 languages.

Runs locally with ONNX Runtime in Python, C#, C++, Java, JavaScript, or any ONNX-compatible runtime. No Python, no cloud APIs, no GPU required.

What's Included

INT8 dynamically quantized ONNX model (2.75 GB total):

File	Size	Description
`cohere-encoder.int8.onnx`	3 MB	Encoder graph (weights in .data file)
`cohere-encoder.int8.onnx.data`	2.6 GB	Encoder weights
`cohere-decoder.int8.onnx`	147 MB	Decoder (self-contained)
`tokens.txt`	219 KB	16,384-entry vocabulary

Plus:

Export and quantization scripts (reproduce FP32 or re-quantize from the original model)
C# inference example
Complete porting guide with all concepts explained

Quick Start (Python)

pip install onnxruntime numpy soundfile librosa

import onnxruntime as ort
import numpy as np
import librosa

# Load audio (must be 16kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)

# Load models
enc = ort.InferenceSession("cohere-encoder.int8.onnx")
dec = ort.InferenceSession("cohere-decoder.int8.onnx")

# Load tokens
tokens = {}
with open("tokens.txt", "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().rsplit(" ", 1)
        if len(parts) == 2:
            tokens[int(parts[1])] = parts[0]
token_to_id = {v: k for k, v in tokens.items()}

# Build prompt (change <|en|> to <|de|>, <|fr|>, etc.)
prompt_ids = [token_to_id[t] for t in [
    "<|startofcontext|>", "<|startoftranscript|>", "<|emo:undefined|>",
    "<|en|>", "<|en|>", "<|pnc|>", "<|noitn|>", "<|notimestamp|>", "<|nodiarize|>"
]]

# Run encoder (raw audio in, encoder representations out)
cross_k, cross_v = enc.run(None, {"audio": audio.reshape(1, -1).astype(np.float32)})

# Run decoder (autoregressive greedy decoding)
N_LAYERS, HEADS, HEAD_DIM, MAX_CTX = 8, 8, 128, 1024
self_k = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
self_v = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
eos_id = token_to_id["<|endoftext|>"]

generated = list(prompt_ids)
current = np.array([prompt_ids], dtype=np.int64)
offset = np.array(0, dtype=np.int64)

for _ in range(256):
    logits, self_k, self_v = dec.run(None, {
        "tokens": current, "in_n_layer_self_k_cache": self_k,
        "in_n_layer_self_v_cache": self_v, "n_layer_cross_k": cross_k,
        "n_layer_cross_v": cross_v, "offset": offset,
    })
    next_id = int(np.argmax(logits[0, -1, :]))
    if next_id == eos_id:
        break
    generated.append(next_id)
    offset = np.array(int(offset) + current.shape[1], dtype=np.int64)
    current = np.array([[next_id]], dtype=np.int64)

# Decode to text
text = "".join(
    tokens.get(t, "").replace("\u2581", " ")
    for t in generated[len(prompt_ids):]
    if not tokens.get(t, "").startswith("<|")
).strip()
print(text)

Quick Start (C#)

dotnet add package Microsoft.ML.OnnxRuntime

See CohereTranscribe.cs for a complete example. For GPU acceleration on Windows:

dotnet add package Microsoft.ML.OnnxRuntime.DirectML

Supported Languages

Code	Language	Code	Language
en	English	nl	Dutch
de	German	pl	Polish
fr	French	el	Greek
es	Spanish	ar	Arabic
it	Italian	ja	Japanese
pt	Portuguese	zh	Chinese
vi	Vietnamese	ko	Korean

Change language by swapping <|en|> in the prompt to the desired language code.

Performance (INT8)

All benchmarks run on a Windows 11 machine with CPU only (no dedicated GPU), 24 GB shared RAM, using the INT8 variant. No special optimization or batching -- just sequential chunk-by-chunk processing.

Audio	Language	Duration	Processing Time	Speed	Tokens Generated
Voice memo	English	30.5s	5.4s	5.6x realtime	61
Interview podcast (Mike Agugliaro / Jay Abraham)	English	57.1 min	14 min 53s	3.8x realtime	~14,500
Language learning podcast (Easy German)	German	35.0 min	9 min 24s	3.7x realtime	~8,900

What the Numbers Mean

5.6x realtime means 1 second of audio is transcribed in ~0.18 seconds
3.7-3.8x realtime on longer files (the slight slowdown vs short clips is because longer audio produces more decoder tokens per chunk)
Encoder takes ~3.5-7s per 30s chunk (the heavy lifting -- 48 Conformer layers)
Decoder takes ~1.5-3s per chunk depending on how many words are spoken

Memory Usage

Component	RAM
Encoder session (loaded once)	~2.6 GB
Decoder session (loaded once)	~146 MB
Per-chunk inference overhead	~300-500 MB
Decoder KV cache	~256 MB
Peak total	~3.5-4 GB

Peak RAM stays constant regardless of audio length -- only one chunk is in memory at a time. A 5-minute file and a 5-hour file use the same amount of RAM.

GPU Acceleration

These benchmarks are CPU-only. For GPU acceleration on Windows, use DirectML (works with any DirectX 12 GPU -- AMD, NVIDIA, Intel). Expected 3-5x additional speedup, bringing throughput to 15-25x realtime.

Long Audio

The encoder handles up to ~35 seconds per call. For longer audio, split into overlapping chunks:

Chunk size: 30 seconds
Overlap: 5 seconds
Stride: 25 seconds

Transcribe each chunk, join results with spaces.

How This Was Made

The original PyTorch model was converted to ONNX with several engineering adaptations:

Feature extraction baked in -- The encoder takes raw 16kHz audio, not mel spectrograms. STFT is implemented via Conv1d DFT filters (since torch.stft doesn't export to ONNX).
Cross-attention K/V pre-computed -- The encoder pre-computes Key/Value projections for all 8 decoder layers, matching the Whisper-style encoder-decoder ONNX pattern.
Selective quantization -- INT8 quantization protects convolutional and batch normalization layers in the audio frontend for accuracy preservation.

See PORTING_GUIDE.md for the complete technical writeup with every concept explained.

ONNX Tensor Contract

Encoder

Input:   audio              (batch, samples)       float32, raw 16kHz mono
Output:  n_layer_cross_k    (8, batch, T', 1024)   float32
         n_layer_cross_v    (8, batch, T', 1024)   float32

Decoder

Input:   tokens                    (batch, n_tokens)              int64
         in_n_layer_self_k_cache   (8, batch, 8, 1024, 128)      float32
         in_n_layer_self_v_cache   (8, batch, 8, 1024, 128)      float32
         n_layer_cross_k           (8, batch, T', 1024)           float32
         n_layer_cross_v           (8, batch, T', 1024)           float32
         offset                    ()                              int64 scalar
Output:  logits                    (batch, n_tokens, 16384)       float32
         out_n_layer_self_k_cache  (8, batch, 8, 1024, 128)      float32
         out_n_layer_self_v_cache  (8, batch, 8, 1024, 128)      float32

Attribution

Original model: CohereLabs/cohere-transcribe-03-2026 by Cohere and Cohere Labs.

ONNX conversion, quantization, and documentation: community contribution.

License: Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for gn64/cohere-transcribe-onnx-int8

Base model

CohereLabs/cohere-transcribe-03-2026

Quantized

(24)

this model