aufklarer's picture
card: unified LiteRT model card with soniqo.audio + ecosystem links
f1c4e67 verified
---
license: apache-2.0
language:
- zh
- yue
- en
- multilingual
tags:
- automatic-speech-recognition
- qwen
- qwen3
- chinese
- cantonese
- litert
- tflite
- on-device
- soniqo
- speech-cloud
- speech-core
base_model: Qwen/Qwen3-ASR-0.6B
library_name: litert
pipeline_tag: automatic-speech-recognition
---
# Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)
Qwen3-ASR audio encoder (zh / yue / en). INT8 weight-only.
> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β€”
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).
## Use cases on soniqo.audio
- [Multilingual transcription](https://soniqo.audio/transcription/)
Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22
Chinese dialects) and 30 additional languages. Exported to LiteRT for
Android. The text decoder is a Qwen3-0.6B LLM and is intended to run
through LiteRT-LM as a separate runtime.
## Model
| Property | Value |
|---|---|
| Component | Audio encoder only |
| Parameters | ~180 M (encoder), decoder is a separate 0.6B LLM |
| Format | LiteRT (TFLite) |
| Quantization | INT8 dynamic weights (fp32 activations) |
| Sample rate | 16 000 Hz |
| Input | 128-bin log mel, 1000 frames (10 s, fixed) |
| Output | 125 audio embedding tokens, 1024-dim each |
| Languages | 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …) |
## Files
| File | Size | Description |
|---|---|---|
| `qwen3-asr-encoder.tflite` | 180.5 MB | Audio encoder, INT8 |
| `config.json` | 1 KB | Architecture + I/O specs |
## Signature
```
Inputs:
mel [1, 128, 1000] float32 10 s log mel spectrogram
Outputs:
audio_embeddings [1, 125, 1024] float32 For cross-attention into the decoder
```
## Architecture
```
mel [1, 128, 1000]
└── 3Γ— Conv2d(stride=2) + GELU β†’ [1, 480, 16, 125]
└── reshape β†’ Linear(7680β†’896) β†’ [1, 125, 896]
└── + sinusoidal pos embed
└── 18Γ— pre-norm Transformer β†’ [1, 125, 896]
└── LayerNorm β†’ Linear(896) β†’ GELU
└── Linear(896β†’1024) β†’ [1, 125, 1024]
```
## Why encoder only
The text decoder is a full Qwen3-0.6B language model with GQA, RoPE,
SwiGLU and RMSNorm. It doesn't fit cleanly into a single `.tflite`; the
right runtime for LLM decoders on Android is
[LiteRT-LM](https://github.com/google-ai-edge/litert-lm) or a comparable
LLM executor, with the audio embeddings from this encoder wired in as
cross-attention context.
For ASR-only (no LLM), pair this encoder with a CTC or transducer head
fine-tuned on your target languages.
## Audio preprocessing
- 16 kHz mono, float32
- 128 log mel bins
- `n_fft=400`, `hop_length=160`, `win_length=400`, `pad_mode="reflect"`
- log mel, mean/std normalization per utterance
The exact reference is in the upstream Qwen3-ASR tokenizer config.
## Source
Upstream: [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)
(Apache 2.0). Released January 2026 as part of the Qwen3 audio family.
## Links
- [speech-android](https://github.com/soniqo/speech-android) β€” Android SDK
- [soniqo.audio](https://soniqo.audio) β€” website
- [blog](https://soniqo.audio/blog) β€” blog
## Ecosystem
- [**soniqo.audio**](https://soniqo.audio) β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) β€” C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) β€” Android SDK consuming on-device LiteRT bundles.
## Other LiteRT models in this collection
**ASR / Transcription**
- [Parakeet TDT 0.6B v3 β€” LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B β€” LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
**VAD / Diarization**
- [Silero VAD v5 β€” LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [Pyannote Segmentation 3.0 β€” LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
- [WeSpeaker ResNet34-LM β€” LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)
**TTS / Voice Cloning**
- [VoxCPM2 β€” LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)
## License
This bundle inherits the upstream model license (**apache-2.0**). See the
linked `base_model` repository for the full terms.