File size: 4,410 Bytes
1b8ea0e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | ---
license: other
license_name: nvidia-open-model-license
license_link: >-
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
language:
- en
metrics:
- wer
library_name: nemo
tags:
- speech-recognition
- FastConformer
- end-of-utterance
- voice agent
pipeline_tag: automatic-speech-recognition
base_model:
- nvidia/parakeet_realtime_eou_120m-v1
base_model_relation: finetune
---
# Parakeet Realtime EOU 120M β CoreML
CoreML conversion of [nvidia/parakeet-realtime-eou-120m-v1](https://huggingface.co/nvidia/parakee
t-realtime-eou-120m-v1) for streaming speech recognition with end-of-utterance detection on Apple
Silicon.
Used by [FluidAudio](https://github.com/FluidInference/FluidAudio) for real-time transcription.
## Models
The RNNT pipeline is split into three CoreML models, exported at two chunk sizes:
| Model | Description |
|-------|-------------|
| `streaming_encoder.mlmodelc` | FastConformer encoder with loopback state caching |
| `decoder.mlmodelc` | 1-layer LSTM decoder (640 hidden units) |
| `joint_decision.mlmodelc` | Joint network for token prediction + EOU detection |
### Chunk Size Variants
| Variant | Latency | WER (test-clean) | RTFx (M2) |
|---------|---------|-------------------|------------|
| `160ms/` | 160ms | 8.29% | 4.78x |
| `320ms/` | 320ms | 4.87% | 12.48x |
Benchmarked on LibriSpeech test-clean (2620 files, 5.40h audio) on Apple M2.
## Usage with FluidAudio
```swift
import FluidAudio
let manager = StreamingEouAsrManager()
await manager.initialize()
// Transcribe with EOU detection
await manager.startStreaming(
eouCallback: { transcript in
print("Utterance complete: \(transcript)")
},
partialCallback: { partial in
print("Partial: \(partial)")
}
)
// Feed audio chunks as they arrive
await manager.feedAudio(samples)
```
CLI
# Transcribe a file
swift run fluidaudio parakeet-eou --input audio.wav
# Benchmark
swift run -c release fluidaudio parakeet-eou --benchmark --chunk-size 320
Architecture
120M parameter RNNT (Recurrent Neural Network Transducer) with:
- Encoder: 17-layer FastConformer with cache-aware streaming
- Decoder: 1-layer LSTM, 640 hidden size
- Joint: Linear projection with 1027 output classes (1024 tokens + EOU token + SOS + blank)
- EOU token: ID 1024 signals end-of-utterance
Streaming State
The encoder maintains loopback state between chunks:
βββββββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββ
β State β Shape β Description β
βββββββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββ€
β preCache β [1, 128, N] β Mel-level context β
βββββββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββ€
β cacheLastChannel β [17, 1, 70, 512] β Conformer layer cache β
βββββββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββ€
β cacheLastTime β [17, 1, 512, 8] β Temporal cache β
βββββββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββ€
β cacheLastChannelLen β [1] β Cache length tracking β
βββββββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββ
Export
Converted from PyTorch using coremltools. To re-export:
python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \
--output-dir Models/ParakeetEOU \
--model-id nvidia/parakeet-realtime-eou-120m-v1
License
NVIDIA Open Model License β see
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/.
Original model: https://huggingface.co/nvidia/parakeet-realtime-eou-120m-v1
--- |