Qwen3-ASR-1.7B ONNX

ONNX export of Qwen/Qwen3-ASR-1.7B for use with ONNX Runtime.

Includes FP32 and int4 MatMulNBits variants. Quantized files use .int4. naming and coexist with FP32 in the same directory. The transcribe-rs engine auto-detects quantized models by file presence.

Files

FP32

File	Description
`encoder.onnx`	Audio encoder — mel spectrogram to features (weights inlined)
`decoder_init.onnx`	Decoder prefill — accepts `input_ids`, outputs logits + KV cache
`decoder_step.onnx`	Decoder autoregressive step — accepts `input_embeds` + KV cache
`decoder_weights.data`	Shared external weights for both FP32 decoders (loaded once)

int4 (RTN, accuracy_level=4)

File	Description
`encoder.int4.onnx`	Encoder (FP32 weights — same as encoder.onnx)
`decoder_init.int4.onnx`	int4 decoder prefill
`decoder_step.int4.onnx`	int4 decoder step
`decoder_weights.int4.data`	Shared external weights for both int4 decoders (loaded once)

Shared

File	Description
`embed_tokens.bin`	Token embedding matrix [151936, 2048], float16
`tokenizer.json`	HuggingFace tokenizer
`config.json`	Architecture config, special tokens, mel params
`preprocessor_config.json`	Mel spectrogram parameters (WhisperFeatureExtractor format)

Architecture

Encoder: mel → windowed Conv2D + windowed attention → audio features
Decoder: Qwen3 (28 layers, GQA, SwiGLU, MRoPE) in split init/step format

The two decoder models use different input strategies:

decoder_init (prefill) accepts input_ids and has the embedding table in its graph — handles audio feature scatter internally
decoder_step (autoregressive) accepts pre-looked-up input_embeds — keeps the embedding table out of its graph

The consumer loads embed_tokens.bin once at startup and performs the embedding lookup per token before calling decoder_step.

Mel Spectrogram Parameters

Identical to Whisper: 16kHz, 128 bins, n_fft=400, hop_length=160, Hann window, Slaney mel scale, 0-8kHz. Also documented in preprocessor_config.json (WhisperFeatureExtractor format).

Inference Pipeline

Compute log-mel spectrogram from 16kHz audio
Run encoder.onnx (or encoder.int4.onnx): mel → audio_features
Build prompt token IDs: <|im_start|>system<|im_end|><|im_start|>user<|audio_start|><|audio_pad|>...<|audio_end|><|im_end|><|im_start|>assistant
Run decoder_init.onnx (or .int4): input_ids + audio_features + audio_offset → logits + KV cache
Greedy decode with decoder_step.onnx (or .int4): look up embed_tokens.bin → input_embeds + KV cache → logits, until EOS

Special Token IDs

Token	ID
`<\|audio_start\|>`	151669
`<\|audio_end\|>`	151670
`<\|audio_pad\|>`	151676
`<\|im_end\|>` (EOS)	151645
`<\|endoftext\|>` (EOS)	151643

Export Tool

Exported with qwen3-asr-onnx.

Downloads last month: 212

Model tree for andrewleech/qwen3-asr-1.7b-onnx

Base model

Qwen/Qwen3-ASR-1.7B

Quantized

(17)

this model