card: unified LiteRT model card with soniqo.audio + ecosystem links

f1c4e67 verified 8 days ago

5.26 kB

	---
	license: apache-2.0
	language:
	- zh
	- yue
	- en
	- multilingual
	tags:
	- automatic-speech-recognition
	- qwen
	- qwen3
	- chinese
	- cantonese
	- litert
	- tflite
	- on-device
	- soniqo
	- speech-cloud
	- speech-core
	base_model: Qwen/Qwen3-ASR-0.6B
	library_name: litert
	pipeline_tag: automatic-speech-recognition
	---

	# Qwen3 ASR 0.6B Encoder — LiteRT (INT8)

	Qwen3-ASR audio encoder (zh / yue / en). INT8 weight-only.

	> Part of the [soniqo.audio](https://soniqo.audio) speech toolkit —
	> an open, runtime-portable stack for speech AI. This bundle is the
	> LiteRT export, designed to plug into the abstract interfaces in
	> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
	> orchestration library). Browse all LiteRT bundles in the
	> [soniqo LiteRT collection](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).

	## Use cases on soniqo.audio

	- [Multilingual transcription](https://soniqo.audio/transcription/)

	Audio encoder of Qwen3-ASR-0.6B, specialized for Chinese (including 22
	Chinese dialects) and 30 additional languages. Exported to LiteRT for
	Android. The text decoder is a Qwen3-0.6B LLM and is intended to run
	through LiteRT-LM as a separate runtime.

	## Model

	\| Property \| Value \|
	\|---\|---\|
	\| Component \| Audio encoder only \|
	\| Parameters \| ~180 M (encoder), decoder is a separate 0.6B LLM \|
	\| Format \| LiteRT (TFLite) \|
	\| Quantization \| INT8 dynamic weights (fp32 activations) \|
	\| Sample rate \| 16 000 Hz \|
	\| Input \| 128-bin log mel, 1000 frames (10 s, fixed) \|
	\| Output \| 125 audio embedding tokens, 1024-dim each \|
	\| Languages \| 30 + 22 Chinese dialects (Cantonese, Shanghainese, Sichuan, …) \|

	## Files

	\| File \| Size \| Description \|
	\|---\|---\|---\|
	\| `qwen3-asr-encoder.tflite` \| 180.5 MB \| Audio encoder, INT8 \|
	\| `config.json` \| 1 KB \| Architecture + I/O specs \|

	## Signature

	```
	Inputs:
	mel [1, 128, 1000] float32 10 s log mel spectrogram

	Outputs:
	audio_embeddings [1, 125, 1024] float32 For cross-attention into the decoder
	```

	## Architecture

	```
	mel [1, 128, 1000]
	└── 3× Conv2d(stride=2) + GELU → [1, 480, 16, 125]
	└── reshape → Linear(7680→896) → [1, 125, 896]
	└── + sinusoidal pos embed
	└── 18× pre-norm Transformer → [1, 125, 896]
	└── LayerNorm → Linear(896) → GELU
	└── Linear(896→1024) → [1, 125, 1024]
	```

	## Why encoder only

	The text decoder is a full Qwen3-0.6B language model with GQA, RoPE,
	SwiGLU and RMSNorm. It doesn't fit cleanly into a single `.tflite`; the
	right runtime for LLM decoders on Android is
	[LiteRT-LM](https://github.com/google-ai-edge/litert-lm) or a comparable
	LLM executor, with the audio embeddings from this encoder wired in as
	cross-attention context.

	For ASR-only (no LLM), pair this encoder with a CTC or transducer head
	fine-tuned on your target languages.

	## Audio preprocessing

	- 16 kHz mono, float32
	- 128 log mel bins
	- `n_fft=400`, `hop_length=160`, `win_length=400`, `pad_mode="reflect"`
	- log mel, mean/std normalization per utterance

	The exact reference is in the upstream Qwen3-ASR tokenizer config.

	## Source

	Upstream: [Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)
	(Apache 2.0). Released January 2026 as part of the Qwen3 audio family.

	## Links

	- [speech-android](https://github.com/soniqo/speech-android) — Android SDK
	- [soniqo.audio](https://soniqo.audio) — website
	- [blog](https://soniqo.audio/blog) — blog

	## Ecosystem

	- [soniqo.audio](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents).
	- [speech-core](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
	- [speech-swift](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
	- [speech-android](https://github.com/soniqo/speech-android) — Android SDK consuming on-device LiteRT bundles.

	## Other LiteRT models in this collection

	ASR / Transcription

	- [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
	- [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
	- [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
	- [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)

	VAD / Diarization

	- [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
	- [Pyannote Segmentation 3.0 — LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
	- [WeSpeaker ResNet34-LM — LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)

	TTS / Voice Cloning

	- [VoxCPM2 — LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)

	## License

	This bundle inherits the upstream model license (apache-2.0). See the
	linked `base_model` repository for the full terms.