File size: 5,264 Bytes

---
license: apache-2.0
language:
  - zh
  - yue
  - en
  - multilingual
tags:
  - automatic-speech-recognition
  - qwen
  - qwen3
  - chinese
  - cantonese
  - litert
  - tflite
  - on-device
  - soniqo
  - speech-cloud
  - speech-core
base_model: Qwen/Qwen3-ASR-0.6B
library_name: litert
pipeline_tag: automatic-speech-recognition
---

# Qwen3 ASR 0.6B Encoder — LiteRT (INT8)

Qwen3-ASR audio encoder (zh / yue / en). INT8 weight-only.

> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit —
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).

## Use cases on soniqo.audio

- [Multilingual transcription](https://soniqo.audio/transcription/)

Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22
Chinese dialects) and 30 additional languages. Exported to LiteRT for
Android. The text decoder is a Qwen3-0.6B LLM and is intended to run
through LiteRT-LM as a separate runtime.

## Model

| Property | Value |
|---|---|
| Component | Audio encoder only |
| Parameters | ~180 M (encoder), decoder is a separate 0.6B LLM |
| Format | LiteRT (TFLite) |
| Quantization | INT8 dynamic weights (fp32 activations) |
| Sample rate | 16 000 Hz |
| Input | 128-bin log mel, 1000 frames (10 s, fixed) |
| Output | 125 audio embedding tokens, 1024-dim each |
| Languages | 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …) |

## Files

| File | Size | Description |
|---|---|---|
| `qwen3-asr-encoder.tflite` | 180.5 MB | Audio encoder, INT8 |
| `config.json` | 1 KB | Architecture + I/O specs |

## Signature

```
Inputs:
  mel               [1, 128, 1000]   float32   10 s log mel spectrogram

Outputs:
  audio_embeddings  [1, 125, 1024]   float32   For cross-attention into the decoder
```

## Architecture

```
mel [1, 128, 1000]
  └── 3× Conv2d(stride=2) + GELU          → [1, 480, 16, 125]
  └── reshape → Linear(7680→896)          → [1, 125, 896]
  └── + sinusoidal pos embed
  └── 18× pre-norm Transformer            → [1, 125, 896]
  └── LayerNorm → Linear(896) → GELU
  └── Linear(896→1024)                    → [1, 125, 1024]
```

## Why encoder only

The text decoder is a full Qwen3-0.6B language model with GQA, RoPE,
SwiGLU and RMSNorm. It doesn't fit cleanly into a single `.tflite`; the
right runtime for LLM decoders on Android is
[LiteRT-LM](https://github.com/google-ai-edge/litert-lm) or a comparable
LLM executor, with the audio embeddings from this encoder wired in as
cross-attention context.

For ASR-only (no LLM), pair this encoder with a CTC or transducer head
fine-tuned on your target languages.

## Audio preprocessing

- 16 kHz mono, float32
- 128 log mel bins
- `n_fft=400`, `hop_length=160`, `win_length=400`, `pad_mode="reflect"`
- log mel, mean/std normalization per utterance

The exact reference is in the upstream Qwen3-ASR tokenizer config.

## Source

Upstream: [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)
(Apache 2.0). Released January 2026 as part of the Qwen3 audio family.

## Links

- [speech-android](https://github.com/soniqo/speech-android) — Android SDK
- [soniqo.audio](https://soniqo.audio) — website
- [blog](https://soniqo.audio/blog) — blog

## Ecosystem

- [**soniqo.audio**](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) — Android SDK consuming on-device LiteRT bundles.

## Other LiteRT models in this collection

**ASR / Transcription**

- [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)

**VAD / Diarization**

- [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [Pyannote Segmentation 3.0 — LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
- [WeSpeaker ResNet34-LM — LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)

**TTS / Voice Cloning**

- [VoxCPM2 — LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)

## License

This bundle inherits the upstream model license (**apache-2.0**). See the
linked `base_model` repository for the full terms.