File size: 4,410 Bytes
1b8ea0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
language:
- en
metrics:
- wer
library_name: nemo
tags:
- speech-recognition
- FastConformer
- end-of-utterance
- voice agent
pipeline_tag: automatic-speech-recognition
base_model:
- nvidia/parakeet_realtime_eou_120m-v1
base_model_relation: finetune
---


# Parakeet Realtime EOU 120M β€” CoreML

CoreML conversion of [nvidia/parakeet-realtime-eou-120m-v1](https://huggingface.co/nvidia/parakee
t-realtime-eou-120m-v1) for streaming speech recognition with end-of-utterance detection on Apple
 Silicon.

Used by [FluidAudio](https://github.com/FluidInference/FluidAudio) for real-time transcription.

## Models

The RNNT pipeline is split into three CoreML models, exported at two chunk sizes:

| Model | Description |
|-------|-------------|
| `streaming_encoder.mlmodelc` | FastConformer encoder with loopback state caching |
| `decoder.mlmodelc` | 1-layer LSTM decoder (640 hidden units) |
| `joint_decision.mlmodelc` | Joint network for token prediction + EOU detection |

### Chunk Size Variants

| Variant | Latency | WER (test-clean) | RTFx (M2) |
|---------|---------|-------------------|------------|
| `160ms/` | 160ms | 8.29% | 4.78x |
| `320ms/` | 320ms | 4.87% | 12.48x |

Benchmarked on LibriSpeech test-clean (2620 files, 5.40h audio) on Apple M2.

## Usage with FluidAudio

```swift
import FluidAudio

let manager = StreamingEouAsrManager()
await manager.initialize()

// Transcribe with EOU detection
await manager.startStreaming(
    eouCallback: { transcript in
        print("Utterance complete: \(transcript)")
    },
    partialCallback: { partial in
        print("Partial: \(partial)")
    }
)

// Feed audio chunks as they arrive
await manager.feedAudio(samples)
```

CLI

# Transcribe a file
swift run fluidaudio parakeet-eou --input audio.wav

# Benchmark
swift run -c release fluidaudio parakeet-eou --benchmark --chunk-size 320

Architecture

120M parameter RNNT (Recurrent Neural Network Transducer) with:
- Encoder: 17-layer FastConformer with cache-aware streaming
- Decoder: 1-layer LSTM, 640 hidden size
- Joint: Linear projection with 1027 output classes (1024 tokens + EOU token + SOS + blank)
- EOU token: ID 1024 signals end-of-utterance

Streaming State

The encoder maintains loopback state between chunks:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        State        β”‚      Shape       β”‚      Description      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ preCache            β”‚ [1, 128, N]      β”‚ Mel-level context     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ cacheLastChannel    β”‚ [17, 1, 70, 512] β”‚ Conformer layer cache β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ cacheLastTime       β”‚ [17, 1, 512, 8]  β”‚ Temporal cache        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ cacheLastChannelLen β”‚ [1]              β”‚ Cache length tracking β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Export

Converted from PyTorch using coremltools. To re-export:

python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \
  --output-dir Models/ParakeetEOU \
  --model-id nvidia/parakeet-realtime-eou-120m-v1

License

NVIDIA Open Model License β€” see
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/.

Original model: https://huggingface.co/nvidia/parakeet-realtime-eou-120m-v1

---