aufklarer's picture
card: unified LiteRT model card with soniqo.audio + ecosystem links
553304e verified
---
license: cc-by-4.0
language:
- multilingual
tags:
- speaker-embedding
- speaker-recognition
- diarization
- litert
- tflite
- on-device
- soniqo
- speech-cloud
- speech-core
base_model: pyannote/wespeaker-voxceleb-resnet34-LM
library_name: litert
pipeline_tag: audio-classification
---
# WeSpeaker ResNet34-LM β€” LiteRT
Speaker embedding for speaker identification and diarization clustering.
> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β€”
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).
## Use cases on soniqo.audio
- [Meeting transcription](https://soniqo.audio/transcription/)
256-dim speaker embedding network for Android, ported from
`pyannote/wespeaker-voxceleb-resnet34-LM`.
## Model
| Property | Value |
|---|---|
| Architecture | ResNet34 + stats pooling + linear projection |
| Parameters | ~6.6 M |
| Format | LiteRT (TFLite) |
| Quantization | float32 |
| Sample rate | 16 000 Hz |
| Input | 80-bin kaldi-style mel fbank features (T frames) |
| Output | L2-normalized 256-dim embedding |
## Files
| File | Size | Description |
|---|---|---|
| `wespeaker-resnet34.tflite` | 25.4 MB | Full model, FP32 |
| `config.json` | 1 KB | Fbank spec + I/O signature |
## Why fbank-as-input
pyannote's kaldi fbank implementation uses `torch.hamming_window` and
`aten._fft_r2c`, neither of which has a lowering in litert-torch. We
export only the ResNet34 portion; the caller computes the 80-bin fbank
features on-device. This matches the standard mobile speaker-embedding
pattern and keeps the tflite graph free of FFT ops.
### Fbank parameters
| Parameter | Value |
|---|---|
| `num_mel_bins` | 80 |
| `frame_length` | 25 ms |
| `frame_shift` | 10 ms |
| `window_type` | hamming |
| `dither` | 0.0 |
| `use_energy` | false |
The reference implementation is `torchaudio.compliance.kaldi.fbank` with
those arguments. The model internally applies `features - mean(features, dim=1)`
centering so the caller may pass raw (uncentered) fbank output.
## Signature
```
Inputs:
fbank [1, T, 80] float32 Kaldi mel fbank, T=298 for 3 s @ 16 kHz
Outputs:
embedding [1, 256] float32 L2-normalized speaker embedding
```
## Parity
Verified `max diff = 4.2e-07` vs the upstream pyannote model's full forward
on a random 3-second waveform (with kaldi fbank features computed
externally).
## Usage
```kotlin
// Compute 80-bin kaldi fbank features on-device with your preferred library
val fbank = kaldiFbank(audio, melBins = 80, frameLengthMs = 25, frameShiftMs = 10)
val model = Interpreter(loadModelFile("wespeaker-resnet34.tflite"))
val embedding = FloatArray(256)
model.run(fbank, embedding)
```
## Source
Upstream: [pyannote/wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM)
(CC BY 4.0, gated β€” accept the license on the upstream page).
## Links
- [speech-android](https://github.com/soniqo/speech-android) β€” Android SDK
- [soniqo.audio](https://soniqo.audio) β€” website
- [blog](https://soniqo.audio/blog) β€” blog
## Ecosystem
- [**soniqo.audio**](https://soniqo.audio) β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) β€” C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) β€” Android SDK consuming on-device LiteRT bundles.
## Other LiteRT models in this collection
**ASR / Transcription**
- [Parakeet TDT 0.6B v3 β€” LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B β€” LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
- [Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)
**VAD / Diarization**
- [Silero VAD v5 β€” LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [Pyannote Segmentation 3.0 β€” LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT)
**TTS / Voice Cloning**
- [VoxCPM2 β€” LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)
## License
This bundle inherits the upstream model license (**cc-by-4.0**). See the
linked `base_model` repository for the full terms.