parakeet-realtime-eou-120m-coreml

Duplicate from FluidInference/parakeet-realtime-eou-120m-coreml

1b8ea0e 5 days ago

4.41 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: >-
	https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
	language:
	- en
	metrics:
	- wer
	library_name: nemo
	tags:
	- speech-recognition
	- FastConformer
	- end-of-utterance
	- voice agent
	pipeline_tag: automatic-speech-recognition
	base_model:
	- nvidia/parakeet_realtime_eou_120m-v1
	base_model_relation: finetune
	---


	# Parakeet Realtime EOU 120M — CoreML

	CoreML conversion of [nvidia/parakeet-realtime-eou-120m-v1](https://huggingface.co/nvidia/parakee
	t-realtime-eou-120m-v1) for streaming speech recognition with end-of-utterance detection on Apple
	Silicon.

	Used by [FluidAudio](https://github.com/FluidInference/FluidAudio) for real-time transcription.

	## Models

	The RNNT pipeline is split into three CoreML models, exported at two chunk sizes:

	\| Model \| Description \|
	\|-------\|-------------\|
	\| `streaming_encoder.mlmodelc` \| FastConformer encoder with loopback state caching \|
	\| `decoder.mlmodelc` \| 1-layer LSTM decoder (640 hidden units) \|
	\| `joint_decision.mlmodelc` \| Joint network for token prediction + EOU detection \|

	### Chunk Size Variants

	\| Variant \| Latency \| WER (test-clean) \| RTFx (M2) \|
	\|---------\|---------\|-------------------\|------------\|
	\| `160ms/` \| 160ms \| 8.29% \| 4.78x \|
	\| `320ms/` \| 320ms \| 4.87% \| 12.48x \|

	Benchmarked on LibriSpeech test-clean (2620 files, 5.40h audio) on Apple M2.

	## Usage with FluidAudio

	```swift
	import FluidAudio

	let manager = StreamingEouAsrManager()
	await manager.initialize()

	// Transcribe with EOU detection
	await manager.startStreaming(
	eouCallback: { transcript in
	print("Utterance complete: \(transcript)")
	},
	partialCallback: { partial in
	print("Partial: \(partial)")
	}
	)

	// Feed audio chunks as they arrive
	await manager.feedAudio(samples)
	```

	CLI

	# Transcribe a file
	swift run fluidaudio parakeet-eou --input audio.wav

	# Benchmark
	swift run -c release fluidaudio parakeet-eou --benchmark --chunk-size 320

	Architecture

	120M parameter RNNT (Recurrent Neural Network Transducer) with:
	- Encoder: 17-layer FastConformer with cache-aware streaming
	- Decoder: 1-layer LSTM, 640 hidden size
	- Joint: Linear projection with 1027 output classes (1024 tokens + EOU token + SOS + blank)
	- EOU token: ID 1024 signals end-of-utterance

	Streaming State

	The encoder maintains loopback state between chunks:
	┌─────────────────────┬──────────────────┬───────────────────────┐
	│ State │ Shape │ Description │
	├─────────────────────┼──────────────────┼───────────────────────┤
	│ preCache │ [1, 128, N] │ Mel-level context │
	├─────────────────────┼──────────────────┼───────────────────────┤
	│ cacheLastChannel │ [17, 1, 70, 512] │ Conformer layer cache │
	├─────────────────────┼──────────────────┼───────────────────────┤
	│ cacheLastTime │ [17, 1, 512, 8] │ Temporal cache │
	├─────────────────────┼──────────────────┼───────────────────────┤
	│ cacheLastChannelLen │ [1] │ Cache length tracking │
	└─────────────────────┴──────────────────┴───────────────────────┘
	Export

	Converted from PyTorch using coremltools. To re-export:

	python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \
	--output-dir Models/ParakeetEOU \
	--model-id nvidia/parakeet-realtime-eou-120m-v1

	License

	NVIDIA Open Model License — see
	https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/.

	Original model: https://huggingface.co/nvidia/parakeet-realtime-eou-120m-v1

	---