Qwen3-ASR-1.7B ONNX

ONNX export of Qwen/Qwen3-ASR-1.7B for use with ONNX Runtime.

Includes FP32 and int4 MatMulNBits variants. Quantized files use .int4. naming and coexist with FP32 in the same directory. The transcribe-rs engine auto-detects quantized models by file presence.

Files

FP32

File Description
encoder.onnx Audio encoder β€” mel spectrogram to features (weights inlined)
decoder_init.onnx Decoder prefill β€” accepts input_ids, outputs logits + KV cache
decoder_step.onnx Decoder autoregressive step β€” accepts input_embeds + KV cache
decoder_weights.data Shared external weights for both FP32 decoders (loaded once)

int4 (RTN, accuracy_level=4)

File Description
encoder.int4.onnx Encoder (FP32 weights β€” same as encoder.onnx)
decoder_init.int4.onnx int4 decoder prefill
decoder_step.int4.onnx int4 decoder step
decoder_weights.int4.data Shared external weights for both int4 decoders (loaded once)

Shared

File Description
embed_tokens.bin Token embedding matrix [151936, 2048], float16
tokenizer.json HuggingFace tokenizer
config.json Architecture config, special tokens, mel params
preprocessor_config.json Mel spectrogram parameters (WhisperFeatureExtractor format)

Architecture

  • Encoder: mel β†’ windowed Conv2D + windowed attention β†’ audio features
  • Decoder: Qwen3 (28 layers, GQA, SwiGLU, MRoPE) in split init/step format

The two decoder models use different input strategies:

  • decoder_init (prefill) accepts input_ids and has the embedding table in its graph β€” handles audio feature scatter internally
  • decoder_step (autoregressive) accepts pre-looked-up input_embeds β€” keeps the embedding table out of its graph

The consumer loads embed_tokens.bin once at startup and performs the embedding lookup per token before calling decoder_step.

Mel Spectrogram Parameters

Identical to Whisper: 16kHz, 128 bins, n_fft=400, hop_length=160, Hann window, Slaney mel scale, 0-8kHz. Also documented in preprocessor_config.json (WhisperFeatureExtractor format).

Inference Pipeline

  1. Compute log-mel spectrogram from 16kHz audio
  2. Run encoder.onnx (or encoder.int4.onnx): mel β†’ audio_features
  3. Build prompt token IDs: <|im_start|>system<|im_end|><|im_start|>user<|audio_start|><|audio_pad|>...<|audio_end|><|im_end|><|im_start|>assistant
  4. Run decoder_init.onnx (or .int4): input_ids + audio_features + audio_offset β†’ logits + KV cache
  5. Greedy decode with decoder_step.onnx (or .int4): look up embed_tokens.bin β†’ input_embeds + KV cache β†’ logits, until EOS

Special Token IDs

Token ID
<|audio_start|> 151669
<|audio_end|> 151670
<|audio_pad|> 151676
<|im_end|> (EOS) 151645
<|endoftext|> (EOS) 151643

Export Tool

Exported with qwen3-asr-onnx.

Downloads last month
212
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for andrewleech/qwen3-asr-1.7b-onnx

Quantized
(17)
this model