File size: 5,264 Bytes
80df339
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b648ffa
 
 
80df339
 
 
 
 
b648ffa
 
 
 
 
 
f1c4e67
 
 
e21dba3
b648ffa
 
 
 
80df339
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b648ffa
 
 
 
 
 
f1c4e67
b648ffa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: apache-2.0
language:
  - zh
  - yue
  - en
  - multilingual
tags:
  - automatic-speech-recognition
  - qwen
  - qwen3
  - chinese
  - cantonese
  - litert
  - tflite
  - on-device
  - soniqo
  - speech-cloud
  - speech-core
base_model: Qwen/Qwen3-ASR-0.6B
library_name: litert
pipeline_tag: automatic-speech-recognition
---

# Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)

Qwen3-ASR audio encoder (zh / yue / en). INT8 weight-only.

> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β€”
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).

## Use cases on soniqo.audio

- [Multilingual transcription](https://soniqo.audio/transcription/)

Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22
Chinese dialects) and 30 additional languages. Exported to LiteRT for
Android. The text decoder is a Qwen3-0.6B LLM and is intended to run
through LiteRT-LM as a separate runtime.

## Model

| Property | Value |
|---|---|
| Component | Audio encoder only |
| Parameters | ~180 M (encoder), decoder is a separate 0.6B LLM |
| Format | LiteRT (TFLite) |
| Quantization | INT8 dynamic weights (fp32 activations) |
| Sample rate | 16 000 Hz |
| Input | 128-bin log mel, 1000 frames (10 s, fixed) |
| Output | 125 audio embedding tokens, 1024-dim each |
| Languages | 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …) |

## Files

| File | Size | Description |
|---|---|---|
| `qwen3-asr-encoder.tflite` | 180.5 MB | Audio encoder, INT8 |
| `config.json` | 1 KB | Architecture + I/O specs |

## Signature

```
Inputs:
  mel               [1, 128, 1000]   float32   10 s log mel spectrogram

Outputs:
  audio_embeddings  [1, 125, 1024]   float32   For cross-attention into the decoder
```

## Architecture

```
mel [1, 128, 1000]
  └── 3Γ— Conv2d(stride=2) + GELU          β†’ [1, 480, 16, 125]
  └── reshape β†’ Linear(7680β†’896)          β†’ [1, 125, 896]
  └── + sinusoidal pos embed
  └── 18Γ— pre-norm Transformer            β†’ [1, 125, 896]
  └── LayerNorm β†’ Linear(896) β†’ GELU
  └── Linear(896β†’1024)                    β†’ [1, 125, 1024]
```

## Why encoder only

The text decoder is a full Qwen3-0.6B language model with GQA, RoPE,
SwiGLU and RMSNorm. It doesn't fit cleanly into a single `.tflite`; the
right runtime for LLM decoders on Android is
[LiteRT-LM](https://github.com/google-ai-edge/litert-lm) or a comparable
LLM executor, with the audio embeddings from this encoder wired in as
cross-attention context.

For ASR-only (no LLM), pair this encoder with a CTC or transducer head
fine-tuned on your target languages.

## Audio preprocessing

- 16 kHz mono, float32
- 128 log mel bins
- `n_fft=400`, `hop_length=160`, `win_length=400`, `pad_mode="reflect"`
- log mel, mean/std normalization per utterance

The exact reference is in the upstream Qwen3-ASR tokenizer config.

## Source

Upstream: [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)
(Apache 2.0). Released January 2026 as part of the Qwen3 audio family.

## Links

- [speech-android](https://github.com/soniqo/speech-android) β€” Android SDK
- [soniqo.audio](https://soniqo.audio) β€” website
- [blog](https://soniqo.audio/blog) β€” blog

## Ecosystem

- [**soniqo.audio**](https://soniqo.audio) β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) β€” C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) β€” Android SDK consuming on-device LiteRT bundles.

## Other LiteRT models in this collection

**ASR / Transcription**

- [Parakeet TDT 0.6B v3 β€” LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B β€” LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)

**VAD / Diarization**

- [Silero VAD v5 β€” LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [Pyannote Segmentation 3.0 β€” LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
- [WeSpeaker ResNet34-LM β€” LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)

**TTS / Voice Cloning**

- [VoxCPM2 β€” LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)

## License

This bundle inherits the upstream model license (**apache-2.0**). See the
linked `base_model` repository for the full terms.