Initial LiteRT upload

Browse files

Files changed (4) hide show

README.md +102 -0
config.json +39 -0
qwen3-asr-encoder.tflite +3 -0
qwen3-asr-encoder_recipe.json +1 -0

README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+---
+license: apache-2.0
+language:
+  - zh
+  - yue
+  - en
+  - multilingual
+tags:
+  - automatic-speech-recognition
+  - qwen
+  - qwen3
+  - chinese
+  - cantonese
+  - litert
+  - tflite
+  - on-device
+  - android
+base_model: Qwen/Qwen3-ASR-0.6B
+library_name: litert
+pipeline_tag: automatic-speech-recognition
+---
+# Qwen3-ASR-0.6B Audio Encoder — LiteRT (INT8)
+Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22
+Chinese dialects) and 30 additional languages. Exported to LiteRT for
+Android. The text decoder is a Qwen3-0.6B LLM and is intended to run
+through LiteRT-LM as a separate runtime.
+## Model
+| Property | Value |
+|---|---|
+| Component | Audio encoder only |
+| Parameters | ~180 M (encoder), decoder is a separate 0.6B LLM |
+| Format | LiteRT (TFLite) |
+| Quantization | INT8 dynamic weights (fp32 activations) |
+| Sample rate | 16 000 Hz |
+| Input | 128-bin log mel, 1000 frames (10 s, fixed) |
+| Output | 125 audio embedding tokens, 1024-dim each |
+| Languages | 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …) |
+## Files
+| File | Size | Description |
+|---|---|---|
+| `qwen3-asr-encoder.tflite` | 180.5 MB | Audio encoder, INT8 |
+| `config.json` | 1 KB | Architecture + I/O specs |
+## Signature
+```
+Inputs:
+  mel               [1, 128, 1000]   float32   10 s log mel spectrogram
+Outputs:
+  audio_embeddings  [1, 125, 1024]   float32   For cross-attention into the decoder
+```
+## Architecture
+```
+mel [1, 128, 1000]
+  └── 3× Conv2d(stride=2) + GELU          → [1, 480, 16, 125]
+  └── reshape → Linear(7680→896)          → [1, 125, 896]
+  └── + sinusoidal pos embed
+  └── 18× pre-norm Transformer            → [1, 125, 896]
+  └── LayerNorm → Linear(896) → GELU
+  └── Linear(896→1024)                    → [1, 125, 1024]
+```
+## Why encoder only
+The text decoder is a full Qwen3-0.6B language model with GQA, RoPE,
+SwiGLU and RMSNorm. It doesn't fit cleanly into a single `.tflite`; the
+right runtime for LLM decoders on Android is
+[LiteRT-LM](https://github.com/google-ai-edge/litert-lm) or a comparable
+LLM executor, with the audio embeddings from this encoder wired in as
+cross-attention context.
+For ASR-only (no LLM), pair this encoder with a CTC or transducer head
+fine-tuned on your target languages.
+## Audio preprocessing
+- 16 kHz mono, float32
+- 128 log mel bins
+- `n_fft=400`, `hop_length=160`, `win_length=400`, `pad_mode="reflect"`
+- log mel, mean/std normalization per utterance
+The exact reference is in the upstream Qwen3-ASR tokenizer config.
+## Source
+Upstream: [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)
+(Apache 2.0). Released January 2026 as part of the Qwen3 audio family.
+## Links
+- [speech-android](https://github.com/soniqo/speech-android) — Android SDK
+- [soniqo.audio](https://soniqo.audio) — website
+- [blog](https://soniqo.audio/blog) — blog

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "model": "Qwen3-ASR-0.6B",
+  "component": "audio_encoder",
+  "format": "tflite",
+  "quantization": "int8",
+  "sample_rate": 16000,
+  "mel_frames_per_second": 100,
+  "input_mel_frames": 1000,
+  "input_mel_bins": 128,
+  "output_tokens": 125,
+  "output_dim": 1024,
+  "encoder": {
+    "num_layers": 18,
+    "d_model": 896,
+    "num_heads": 14,
+    "ffn_dim": 3584
+  },
+  "inputs": {
+    "mel": {
+      "shape": [
+        1,
+        128,
+        1000
+      ],
+      "dtype": "float32"
+    }
+  },
+  "outputs": {
+    "audio_embeddings": {
+      "shape": [
+        1,
+        125,
+        1024
+      ],
+      "dtype": "float32"
+    }
+  },
+  "note": "This is the audio encoder only. The text decoder is a Qwen3-0.6B LLM; run it through LiteRT-LM (separate runtime) with the encoder outputs as cross-attention context. Supports 30 languages + 22 Chinese dialects."
+}

qwen3-asr-encoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:55a38764a35d189b24845d7ce52e0139ee706a1275e4f3efae83f95bae62a4ad
+size 189283568

qwen3-asr-encoder_recipe.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ [{"regex": ".", "operation": "", "algorithm_key": "min_max_uniform_quantize", "op_config": {"weight_tensor_config": {"num_bits": 8, "symmetric": true, "granularity": "CHANNELWISE", "dtype": "INT"}, "compute_precision": "INTEGER", "explicit_dequantize": false, "skip_checks": false, "min_weight_elements": 0}}]