aoiandroid's picture
Duplicate from FluidInference/parakeet-ctc-110m-coreml
94a10a5
metadata
language:
  - en
library_name: nemo
datasets:
  - librispeech_asr
  - fisher_corpus
  - mozilla-foundation/common_voice_8_0
  - National-Singapore-Corpus-Part-1
  - vctk
  - voxpopuli
  - europarl
  - multilingual_librispeech
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - TDT
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
  - hf-asr-leaderboard
license: cc-by-4.0
widget:
  - example_title: Librispeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: Librispeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
  - name: parakeet-tdt_ctc-110m
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: AMI (Meetings test)
          type: edinburghcstr/ami
          config: ihm
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 15.88
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Earnings-22
          type: revdotcom/earnings22
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 12.42
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: GigaSpeech
          type: speechcolab/gigaspeech
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 10.52
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (clean)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.4
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (other)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 5.2
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: SPGI Speech
          type: kensho/spgispeech
          config: test
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.54
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: tedlium-v3
          type: LIUM/tedlium
          config: release1
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 4.16
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Vox Populi
          type: facebook/voxpopuli
          config: en
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 6.91
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
base_model:
  - nvidia/parakeet-tdt_ctc-110m

Parakeet-TDT-CTC-110M CoreML

NVIDIA's Parakeet-TDT-CTC-110M model converted to CoreML format for efficient inference on Apple Silicon.

Model Description

This is a hybrid ASR model with a shared Conformer encoder and two decoder heads:

  • CTC Head: Fast greedy decoding, ideal for keyword spotting
  • TDT Head: Token-Duration Transducer for high-quality transcription

Architecture

Component Description Size
Preprocessor Mel spectrogram extraction ~1 MB
Encoder Conformer encoder (shared) ~400 MB
CTCHead CTC output projection ~4 MB
Decoder TDT prediction network (LSTM) ~25 MB
JointDecision TDT joint network ~6 MB

Total size: ~436 MB

Performance

Benchmarked on Earnings22 dataset (772 audio files):

Metric Value
Keyword Recall 100% (1309/1309)
WER 17.97%
RTFx (M4 Pro) 358x real-time

Requirements

  • macOS 13+ (Ventura or later)
  • Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+

Installation

# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

# For audio file support (WAV, MP3, etc.)
pip install -e ".[audio]"

Usage

Python Inference

from scripts.inference import ParakeetCoreML

# Load model (from current directory with .mlpackage files)
model = ParakeetCoreML(".")

# Transcribe with TDT (higher quality)
text = model.transcribe("audio.wav", mode="tdt")
print(text)

# Or use CTC for faster keyword spotting
text = model.transcribe("audio.wav", mode="ctc")
print(text)

Command Line

# TDT decoding (default, higher quality)
uv run scripts/inference.py --audio audio.wav

# CTC decoding (faster, good for keyword spotting)
uv run scripts/inference.py --audio audio.wav --mode ctc

Model Conversion

To convert from the original NeMo model:

# Install conversion dependencies
uv sync --extra convert

# Run conversion
uv run scripts/convert_nemo_to_coreml.py --output-dir ./model

This will:

  1. Download the original model from NVIDIA (nvidia/parakeet-tdt_ctc-110m)
  2. Convert each component to CoreML format
  3. Extract vocabulary and create metadata

File Structure

./
β”œβ”€β”€ Preprocessor.mlpackage    # Audio β†’ Mel spectrogram
β”œβ”€β”€ Encoder.mlpackage         # Mel β†’ Encoder features
β”œβ”€β”€ CTCHead.mlpackage         # Encoder β†’ CTC log probs
β”œβ”€β”€ Decoder.mlpackage         # TDT prediction network
β”œβ”€β”€ JointDecision.mlpackage   # TDT joint network
β”œβ”€β”€ vocab.json                # Token vocabulary (1024 tokens)
β”œβ”€β”€ metadata.json             # Model configuration
β”œβ”€β”€ pyproject.toml            # Python dependencies
β”œβ”€β”€ uv.lock                   # Locked dependencies
└── scripts/                  # Inference & conversion scripts

Decoding Modes

TDT Mode (Recommended for Transcription)

  • Uses Token-Duration Transducer decoding
  • Higher accuracy (17.97% WER)
  • Predicts both tokens and durations
  • Best for full transcription tasks

CTC Mode (Recommended for Keyword Spotting)

  • Greedy CTC decoding
  • Faster inference
  • 100% keyword recall on Earnings22
  • Best for detecting specific words/phrases

Custom Vocabulary / Keyword Spotting

For keyword spotting, CTC mode with custom vocabulary boosting achieves 100% recall:

# Load custom vocabulary with token IDs
with open("custom_vocab.json") as f:
    keywords = json.load(f)  # {"keyword": [token_ids], ...}

# Run CTC decoding
tokens = model.decode_ctc(encoder_output)

# Check for keyword matches
for keyword, expected_ids in keywords.items():
    if is_subsequence(expected_ids, tokens):
        print(f"Found keyword: {keyword}")

License

This model conversion is released under the Apache 2.0 License, same as the original NVIDIA model.

Citation

If you use this model, please cite the original NVIDIA work:

@misc{nvidia_parakeet_tdt_ctc,
  title={Parakeet-TDT-CTC-110M},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/parakeet-tdt_ctc-110m}
}

Acknowledgments