--- license: cc-by-4.0 language: - multilingual tags: - speaker-embedding - speaker-recognition - diarization - litert - tflite - on-device - soniqo - speech-cloud - speech-core base_model: pyannote/wespeaker-voxceleb-resnet34-LM library_name: litert pipeline_tag: audio-classification --- # WeSpeaker ResNet34-LM — LiteRT Speaker embedding for speaker identification and diarization clustering. > Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit — > an open, runtime-portable stack for speech AI. This bundle is the > **LiteRT** export, designed to plug into the abstract interfaces in > [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent > orchestration library). Browse all LiteRT bundles in the > [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b). ## Use cases on soniqo.audio - [Meeting transcription](https://soniqo.audio/transcription/) 256-dim speaker embedding network for Android, ported from `pyannote/wespeaker-voxceleb-resnet34-LM`. ## Model | Property | Value | |---|---| | Architecture | ResNet34 + stats pooling + linear projection | | Parameters | ~6.6 M | | Format | LiteRT (TFLite) | | Quantization | float32 | | Sample rate | 16 000 Hz | | Input | 80-bin kaldi-style mel fbank features (T frames) | | Output | L2-normalized 256-dim embedding | ## Files | File | Size | Description | |---|---|---| | `wespeaker-resnet34.tflite` | 25.4 MB | Full model, FP32 | | `config.json` | 1 KB | Fbank spec + I/O signature | ## Why fbank-as-input pyannote's kaldi fbank implementation uses `torch.hamming_window` and `aten._fft_r2c`, neither of which has a lowering in litert-torch. We export only the ResNet34 portion; the caller computes the 80-bin fbank features on-device. This matches the standard mobile speaker-embedding pattern and keeps the tflite graph free of FFT ops. ### Fbank parameters | Parameter | Value | |---|---| | `num_mel_bins` | 80 | | `frame_length` | 25 ms | | `frame_shift` | 10 ms | | `window_type` | hamming | | `dither` | 0.0 | | `use_energy` | false | The reference implementation is `torchaudio.compliance.kaldi.fbank` with those arguments. The model internally applies `features - mean(features, dim=1)` centering so the caller may pass raw (uncentered) fbank output. ## Signature ``` Inputs: fbank [1, T, 80] float32 Kaldi mel fbank, T=298 for 3 s @ 16 kHz Outputs: embedding [1, 256] float32 L2-normalized speaker embedding ``` ## Parity Verified `max diff = 4.2e-07` vs the upstream pyannote model's full forward on a random 3-second waveform (with kaldi fbank features computed externally). ## Usage ```kotlin // Compute 80-bin kaldi fbank features on-device with your preferred library val fbank = kaldiFbank(audio, melBins = 80, frameLengthMs = 25, frameShiftMs = 10) val model = Interpreter(loadModelFile("wespeaker-resnet34.tflite")) val embedding = FloatArray(256) model.run(fbank, embedding) ``` ## Source Upstream: [pyannote/wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM) (CC BY 4.0, gated — accept the license on the upstream page). ## Links - [speech-android](https://github.com/soniqo/speech-android) — Android SDK - [soniqo.audio](https://soniqo.audio) — website - [blog](https://soniqo.audio/blog) — blog ## Ecosystem - [**soniqo.audio**](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents). - [**speech-core**](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces. - [**speech-swift**](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable). - [**speech-android**](https://github.com/soniqo/speech-android) — Android SDK consuming on-device LiteRT bundles. ## Other LiteRT models in this collection **ASR / Transcription** - [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8) - [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT) - [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT) - [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8) - [Qwen3 ASR 0.6B Encoder — LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8) **VAD / Diarization** - [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT) - [Pyannote Segmentation 3.0 — LiteRT](https://huggingface.co/soniqo/Pyannote-Segmentation-LiteRT) **TTS / Voice Cloning** - [VoxCPM2 — LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8) ## License This bundle inherits the upstream model license (**cc-by-4.0**). See the linked `base_model` repository for the full terms.