aufklarer commited on
Commit
80df339
Β·
verified Β·
1 Parent(s): fc61827

Initial LiteRT upload

Browse files
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - yue
6
+ - en
7
+ - multilingual
8
+ tags:
9
+ - automatic-speech-recognition
10
+ - qwen
11
+ - qwen3
12
+ - chinese
13
+ - cantonese
14
+ - litert
15
+ - tflite
16
+ - on-device
17
+ - android
18
+ base_model: Qwen/Qwen3-ASR-0.6B
19
+ library_name: litert
20
+ pipeline_tag: automatic-speech-recognition
21
+ ---
22
+
23
+ # Qwen3-ASR-0.6B Audio Encoder β€” LiteRT (INT8)
24
+
25
+ Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22
26
+ Chinese dialects) and 30 additional languages. Exported to LiteRT for
27
+ Android. The text decoder is a Qwen3-0.6B LLM and is intended to run
28
+ through LiteRT-LM as a separate runtime.
29
+
30
+ ## Model
31
+
32
+ | Property | Value |
33
+ |---|---|
34
+ | Component | Audio encoder only |
35
+ | Parameters | ~180 M (encoder), decoder is a separate 0.6B LLM |
36
+ | Format | LiteRT (TFLite) |
37
+ | Quantization | INT8 dynamic weights (fp32 activations) |
38
+ | Sample rate | 16 000 Hz |
39
+ | Input | 128-bin log mel, 1000 frames (10 s, fixed) |
40
+ | Output | 125 audio embedding tokens, 1024-dim each |
41
+ | Languages | 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …) |
42
+
43
+ ## Files
44
+
45
+ | File | Size | Description |
46
+ |---|---|---|
47
+ | `qwen3-asr-encoder.tflite` | 180.5 MB | Audio encoder, INT8 |
48
+ | `config.json` | 1 KB | Architecture + I/O specs |
49
+
50
+ ## Signature
51
+
52
+ ```
53
+ Inputs:
54
+ mel [1, 128, 1000] float32 10 s log mel spectrogram
55
+
56
+ Outputs:
57
+ audio_embeddings [1, 125, 1024] float32 For cross-attention into the decoder
58
+ ```
59
+
60
+ ## Architecture
61
+
62
+ ```
63
+ mel [1, 128, 1000]
64
+ └── 3Γ— Conv2d(stride=2) + GELU β†’ [1, 480, 16, 125]
65
+ └── reshape β†’ Linear(7680β†’896) β†’ [1, 125, 896]
66
+ └── + sinusoidal pos embed
67
+ └── 18Γ— pre-norm Transformer β†’ [1, 125, 896]
68
+ └── LayerNorm β†’ Linear(896) β†’ GELU
69
+ └── Linear(896β†’1024) β†’ [1, 125, 1024]
70
+ ```
71
+
72
+ ## Why encoder only
73
+
74
+ The text decoder is a full Qwen3-0.6B language model with GQA, RoPE,
75
+ SwiGLU and RMSNorm. It doesn't fit cleanly into a single `.tflite`; the
76
+ right runtime for LLM decoders on Android is
77
+ [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) or a comparable
78
+ LLM executor, with the audio embeddings from this encoder wired in as
79
+ cross-attention context.
80
+
81
+ For ASR-only (no LLM), pair this encoder with a CTC or transducer head
82
+ fine-tuned on your target languages.
83
+
84
+ ## Audio preprocessing
85
+
86
+ - 16 kHz mono, float32
87
+ - 128 log mel bins
88
+ - `n_fft=400`, `hop_length=160`, `win_length=400`, `pad_mode="reflect"`
89
+ - log mel, mean/std normalization per utterance
90
+
91
+ The exact reference is in the upstream Qwen3-ASR tokenizer config.
92
+
93
+ ## Source
94
+
95
+ Upstream: [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)
96
+ (Apache 2.0). Released January 2026 as part of the Qwen3 audio family.
97
+
98
+ ## Links
99
+
100
+ - [speech-android](https://github.com/soniqo/speech-android) β€” Android SDK
101
+ - [soniqo.audio](https://soniqo.audio) β€” website
102
+ - [blog](https://soniqo.audio/blog) β€” blog
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "Qwen3-ASR-0.6B",
3
+ "component": "audio_encoder",
4
+ "format": "tflite",
5
+ "quantization": "int8",
6
+ "sample_rate": 16000,
7
+ "mel_frames_per_second": 100,
8
+ "input_mel_frames": 1000,
9
+ "input_mel_bins": 128,
10
+ "output_tokens": 125,
11
+ "output_dim": 1024,
12
+ "encoder": {
13
+ "num_layers": 18,
14
+ "d_model": 896,
15
+ "num_heads": 14,
16
+ "ffn_dim": 3584
17
+ },
18
+ "inputs": {
19
+ "mel": {
20
+ "shape": [
21
+ 1,
22
+ 128,
23
+ 1000
24
+ ],
25
+ "dtype": "float32"
26
+ }
27
+ },
28
+ "outputs": {
29
+ "audio_embeddings": {
30
+ "shape": [
31
+ 1,
32
+ 125,
33
+ 1024
34
+ ],
35
+ "dtype": "float32"
36
+ }
37
+ },
38
+ "note": "This is the audio encoder only. The text decoder is a Qwen3-0.6B LLM; run it through LiteRT-LM (separate runtime) with the encoder outputs as cross-attention context. Supports 30 languages + 22 Chinese dialects."
39
+ }
qwen3-asr-encoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:55a38764a35d189b24845d7ce52e0139ee706a1275e4f3efae83f95bae62a4ad
3
+ size 189283568
qwen3-asr-encoder_recipe.json ADDED
@@ -0,0 +1 @@
 
 
1
+ [{"regex": ".*", "operation": "*", "algorithm_key": "min_max_uniform_quantize", "op_config": {"weight_tensor_config": {"num_bits": 8, "symmetric": true, "granularity": "CHANNELWISE", "dtype": "INT"}, "compute_precision": "INTEGER", "explicit_dequantize": false, "skip_checks": false, "min_weight_elements": 0}}]