Omnilingual ASR CTC 300M — LiteRT (INT8)

Meta's 1600-language wav2vec2 CTC ASR, INT8 weight-only quantization.

Part of the soniqo.audio speech toolkit — an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Multilingual transcription

Meta's Omnilingual ASR (Wav2Vec2-CTC, 300 M parameters) exported to LiteRT for Android. Supports 1600+ languages and doubles as a forced-alignment model via standard CTC Viterbi decoding on the output logits.

Model

Property	Value
Architecture	Wav2Vec2 temporal CNN frontend + 24-layer Transformer + CTC head
Parameters	~300 M
Format	LiteRT (TFLite)
Quantization	INT8 dynamic weights (fp32 activations)
Sample rate	16 000 Hz
Input length	160 000 samples (10 s, fixed)
Frame rate	50 Hz (320× downsample)
Vocab size	10 288 (SentencePiece)

Files

File	Size	Description
`omnilingual-ctc-300m.tflite`	315.2 MB	Full model, INT8
`tokenizer.model`	89 KB	SentencePiece tokenizer
`config.json`	1 KB	Model + fbank specs

Signature

Inputs:
  audio        [1, 160000]  float32   z-score normalized 10 s @ 16 kHz

Outputs:
  logits       [1, 500, 10288]  float32   per-frame CTC logits (50 Hz)

Capabilities

1. Greedy / beam ASR

Standard CTC decoding: argmax per frame, collapse repeats, remove blanks, decode with SentencePiece. Works across 1600+ languages.

2. Forced alignment

Given a target transcription, run the CTC Viterbi forced-alignment algorithm over the [T, vocab_size] posteriors to recover per-token start/end frame positions. Convert frame indices to seconds with frame_rate = 50 Hz (1 frame = 20 ms).

Reference implementations you can port directly:

torchaudio.functional.forced_align
ctc-forced-aligner (MahmoudAshraf97)
The CTC alignment pseudocode in the Wav2Vec2 paper

The full DP is ~100 lines of straightforward code and runs in microseconds per utterance on mobile CPUs.

Usage

val model = Interpreter(loadModelFile("omnilingual-ctc-300m.tflite"))

// 10 s of z-score normalized audio at 16 kHz
val audio = FloatArray(160_000)
val logits = Array(1) { Array(500) { FloatArray(10_288) } }
model.run(audio, logits)

// Greedy CTC → tokens
val tokens = logits[0].map { frame -> frame.indexOfMax() }

// Or: CTC forced alignment for timestamps
val alignment = ctcForcedAlign(logits[0], tokenize("hello world"))

Source

Upstream: Meta Omnilingual ASR project (CC BY-NC 4.0). Paper: Omnilingual ASR Technical Report (2026).

Ecosystem

soniqo.audio — use-case explorer (transcription, voice cloning, live ASR, voice agents).
speech-core — C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
speech-swift — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
speech-android — Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

VoxCPM2 — LiteRT (INT8)

License

This bundle inherits the upstream model license (cc-by-nc-4.0). See the linked base_model repository for the full terms.

Downloads last month: 17

Collection including soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8

LiteRT

Collection

LiteRT (.tflite) bundles for soniqo.audio. ASR, VAD, diarization, speaker ID, streaming, TTS — served by speech-cloud and speech-core. • 9 items • Updated 8 days ago

soniqo
/

Omnilingual-ASR-CTC-300M-LiteRT-INT8