--- license: apache-2.0 language: - zh - yue - en - multilingual tags: - automatic-speech-recognition - qwen - qwen3 - chinese - cantonese - litert - tflite - on-device - soniqo - speech-cloud - speech-core base_model: Qwen/Qwen3-ASR-0.6B library_name: litert pipeline_tag: automatic-speech-recognition --- # Qwen3 ASR 0.6B Encoder — LiteRT (INT8) Qwen3-ASR audio encoder (zh / yue / en). INT8 weight-only. > Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit — > an open, runtime-portable stack for speech AI. This bundle is the > **LiteRT** export, designed to plug into the abstract interfaces in > [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent > orchestration library). Browse all LiteRT bundles in the > [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b). ## Use cases on soniqo.audio - [Multilingual transcription](https://soniqo.audio/transcription/) Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22 Chinese dialects) and 30 additional languages. Exported to LiteRT for Android. The text decoder is a Qwen3-0.6B LLM and is intended to run through LiteRT-LM as a separate runtime. ## Model | Property | Value | |---|---| | Component | Audio encoder only | | Parameters | ~180 M (encoder), decoder is a separate 0.6B LLM | | Format | LiteRT (TFLite) | | Quantization | INT8 dynamic weights (fp32 activations) | | Sample rate | 16 000 Hz | | Input | 128-bin log mel, 1000 frames (10 s, fixed) | | Output | 125 audio embedding tokens, 1024-dim each | | Languages | 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …) | ## Files | File | Size | Description | |---|---|---| | `qwen3-asr-encoder.tflite` | 180.5 MB | Audio encoder, INT8 | | `config.json` | 1 KB | Architecture + I/O specs | ## Signature ``` Inputs: mel [1, 128, 1000] float32 10 s log mel spectrogram Outputs: audio_embeddings [1, 125, 1024] float32 For cross-attention into the decoder ``` ## Architecture ``` mel [1, 128, 1000] └── 3× Conv2d(stride=2) + GELU → [1, 480, 16, 125] └── reshape → Linear(7680→896) → [1, 125, 896] └── + sinusoidal pos embed └── 18× pre-norm Transformer → [1, 125, 896] └── LayerNorm → Linear(896) → GELU └── Linear(896→1024) → [1, 125, 1024] ``` ## Why encoder only The text decoder is a full Qwen3-0.6B language model with GQA, RoPE, SwiGLU and RMSNorm. It doesn't fit cleanly into a single `.tflite`; the right runtime for LLM decoders on Android is [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) or a comparable LLM executor, with the audio embeddings from this encoder wired in as cross-attention context. For ASR-only (no LLM), pair this encoder with a CTC or transducer head fine-tuned on your target languages. ## Audio preprocessing - 16 kHz mono, float32 - 128 log mel bins - `n_fft=400`, `hop_length=160`, `win_length=400`, `pad_mode="reflect"` - log mel, mean/std normalization per utterance The exact reference is in the upstream Qwen3-ASR tokenizer config. ## Source Upstream: [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B) (Apache 2.0). Released January 2026 as part of the Qwen3 audio family. ## Links - [speech-android](https://github.com/soniqo/speech-android) — Android SDK - [soniqo.audio](https://soniqo.audio) — website - [blog](https://soniqo.audio/blog) — blog ## Ecosystem - [**soniqo.audio**](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents). - [**speech-core**](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces. - [**speech-swift**](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable). - [**speech-android**](https://github.com/soniqo/speech-android) — Android SDK consuming on-device LiteRT bundles. ## Other LiteRT models in this collection **ASR / Transcription** - [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8) - [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT) - [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT) - [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8) **VAD / Diarization** - [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT) - [Pyannote Segmentation 3.0 — LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT) - [WeSpeaker ResNet34-LM — LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT) **TTS / Voice Cloning** - [VoxCPM2 — LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8) ## License This bundle inherits the upstream model license (**apache-2.0**). See the linked `base_model` repository for the full terms.