Omnilingual ASR CTC 300M β€” LiteRT

Meta's 1600-language wav2vec2 CTC ASR. FP32 bundle for highest fidelity.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Meta's Omnilingual ASR (Wav2Vec2-CTC, 300 M parameters) exported to LiteRT for Android. Supports 1600+ languages and doubles as a forced-alignment model via standard CTC Viterbi decoding on the output logits.

Model

Property Value
Architecture Wav2Vec2 temporal CNN frontend + 24-layer Transformer + CTC head
Parameters ~300 M
Format LiteRT (TFLite)
Quantization INT8 dynamic weights (fp32 activations)
Sample rate 16 000 Hz
Input length 160 000 samples (10 s, fixed)
Frame rate 50 Hz (320Γ— downsample)
Vocab size 10 288 (SentencePiece)

Files

File Size Description
omnilingual-ctc-300m.tflite 315.2 MB Full model, INT8
tokenizer.model 89 KB SentencePiece tokenizer
config.json 1 KB Model + fbank specs

Signature

Inputs:
  audio        [1, 160000]  float32   z-score normalized 10 s @ 16 kHz

Outputs:
  logits       [1, 500, 10288]  float32   per-frame CTC logits (50 Hz)

Capabilities

1. Greedy / beam ASR

Standard CTC decoding: argmax per frame, collapse repeats, remove blanks, decode with SentencePiece. Works across 1600+ languages.

2. Forced alignment

Given a target transcription, run the CTC Viterbi forced-alignment algorithm over the [T, vocab_size] posteriors to recover per-token start/end frame positions. Convert frame indices to seconds with frame_rate = 50 Hz (1 frame = 20 ms).

Reference implementations you can port directly:

  • torchaudio.functional.forced_align
  • ctc-forced-aligner (MahmoudAshraf97)
  • The CTC alignment pseudocode in the Wav2Vec2 paper

The full DP is ~100 lines of straightforward code and runs in microseconds per utterance on mobile CPUs.

Usage

val model = Interpreter(loadModelFile("omnilingual-ctc-300m.tflite"))

// 10 s of z-score normalized audio at 16 kHz
val audio = FloatArray(160_000)
val logits = Array(1) { Array(500) { FloatArray(10_288) } }
model.run(audio, logits)

// Greedy CTC β†’ tokens
val tokens = logits[0].map { frame -> frame.indexOfMax() }

// Or: CTC forced alignment for timestamps
val alignment = ctcForcedAlign(logits[0], tokenize("hello world"))

Source

Upstream: Meta Omnilingual ASR project (CC BY-NC 4.0). Paper: Omnilingual ASR Technical Report (2026).

Links

Ecosystem

  • soniqo.audio β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
  • speech-core β€” C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
  • speech-swift β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
  • speech-android β€” Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

License

This bundle inherits the upstream model license (cc-by-nc-4.0). See the linked base_model repository for the full terms.

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including soniqo/Omnilingual-ASR-CTC-300M-LiteRT