Qwen3-ASR-1.7B ONNX
ONNX export of Qwen/Qwen3-ASR-1.7B for use with ONNX Runtime.
Includes FP32 and int4 MatMulNBits variants. Quantized files use .int4. naming and coexist with FP32 in the same directory. The transcribe-rs engine auto-detects quantized models by file presence.
Files
FP32
| File | Description |
|---|---|
encoder.onnx |
Audio encoder β mel spectrogram to features (weights inlined) |
decoder_init.onnx |
Decoder prefill β accepts input_ids, outputs logits + KV cache |
decoder_step.onnx |
Decoder autoregressive step β accepts input_embeds + KV cache |
decoder_weights.data |
Shared external weights for both FP32 decoders (loaded once) |
int4 (RTN, accuracy_level=4)
| File | Description |
|---|---|
encoder.int4.onnx |
Encoder (FP32 weights β same as encoder.onnx) |
decoder_init.int4.onnx |
int4 decoder prefill |
decoder_step.int4.onnx |
int4 decoder step |
decoder_weights.int4.data |
Shared external weights for both int4 decoders (loaded once) |
Shared
| File | Description |
|---|---|
embed_tokens.bin |
Token embedding matrix [151936, 2048], float16 |
tokenizer.json |
HuggingFace tokenizer |
config.json |
Architecture config, special tokens, mel params |
preprocessor_config.json |
Mel spectrogram parameters (WhisperFeatureExtractor format) |
Architecture
- Encoder: mel β windowed Conv2D + windowed attention β audio features
- Decoder: Qwen3 (28 layers, GQA, SwiGLU, MRoPE) in split init/step format
The two decoder models use different input strategies:
decoder_init(prefill) acceptsinput_idsand has the embedding table in its graph β handles audio feature scatter internallydecoder_step(autoregressive) accepts pre-looked-upinput_embedsβ keeps the embedding table out of its graph
The consumer loads embed_tokens.bin once at startup and performs the embedding lookup per token before calling decoder_step.
Mel Spectrogram Parameters
Identical to Whisper: 16kHz, 128 bins, n_fft=400, hop_length=160, Hann window, Slaney mel scale, 0-8kHz.
Also documented in preprocessor_config.json (WhisperFeatureExtractor format).
Inference Pipeline
- Compute log-mel spectrogram from 16kHz audio
- Run
encoder.onnx(orencoder.int4.onnx): mel β audio_features - Build prompt token IDs:
<|im_start|>system<|im_end|><|im_start|>user<|audio_start|><|audio_pad|>...<|audio_end|><|im_end|><|im_start|>assistant - Run
decoder_init.onnx(or.int4):input_ids+audio_features+audio_offsetβ logits + KV cache - Greedy decode with
decoder_step.onnx(or.int4): look upembed_tokens.binβinput_embeds+ KV cache β logits, until EOS
Special Token IDs
| Token | ID |
|---|---|
<|audio_start|> |
151669 |
<|audio_end|> |
151670 |
<|audio_pad|> |
151676 |
<|im_end|> (EOS) |
151645 |
<|endoftext|> (EOS) |
151643 |
Export Tool
Exported with qwen3-asr-onnx.
- Downloads last month
- 212
Model tree for andrewleech/qwen3-asr-1.7b-onnx
Base model
Qwen/Qwen3-ASR-1.7B