Parakeet TDT+CTC 110M β€” GGUF (ggml-quantised)

GGUF / ggml conversions of nvidia/parakeet-tdt_ctc-110m for use with the crispasr CLI from CrispStrobe/CrispASR.

The smallest Parakeet variant: 110 M parameters, 17-layer FastConformer encoder with both TDT and CTC heads. Designed for low-RAM hosts and high-throughput batch transcription.

  • English-only, mixed-case + punctuation output
  • Hybrid TDT+CTC head β€” runtime defaults to CTC decode (the TDT path needs a 2-LSTM predictor which this model doesn't have; single-LSTM predictor + CTC head is a deliberate trade for size)
  • ~45Γ— realtime on M1 Metal with Q4_K β€” fastest in the family
  • CC-BY-4.0 licence

This repo provides three quantisations, all converted from the same .nemo checkpoint via the convert-parakeet-to-gguf.py script and quantised with crispasr-quantize.

Files

File Size Notes
parakeet-tdt_ctc-110m.gguf 230 MB F16, full precision
parakeet-tdt_ctc-110m-q8_0.gguf 139 MB Q8_0, near-lossless
parakeet-tdt_ctc-110m-q4_k.gguf 91 MB Q4_K β€” recommended default

Smoke test on samples/jfk.wav (11 s clip, M1 Metal):

Quant Time Realtime Output
F16 0.24 s 45.7Γ— "And so, my fellow Americans, askk not what your country can do for you. Ask what you can do for your country."
Q8_0 0.25 s 44.6Γ— (identical)
Q4_K 0.25 s 43.8Γ— (identical)

Note: the model emits "askk" instead of "ask" on this clip β€” a small-model artifact, not a runtime bug. Larger Parakeet variants (v2 / tdt-1.1b / tdt_ctc-1.1b) get it right.

Quick Start

# 1. Build the runtime
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target crispasr

# 2a. Auto-download via the registry key
./build/bin/crispasr -m parakeet-tdt_ctc-110m --auto-download -f your-audio.wav

# 2b. Or explicit download + load
hf download cstr/parakeet-tdt_ctc-110m-GGUF \
    parakeet-tdt_ctc-110m-q4_k.gguf --local-dir .
./build/bin/crispasr -m parakeet-tdt_ctc-110m-q4_k.gguf -f your-audio.wav

The runtime detects pred_layers=1 + has_ctc=True at load time and automatically flips to CTC decode β€” no flag needed. See parakeet_init_from_file in src/parakeet.cpp.

When to pick this over the other Parakeet variants

Scenario Pick
English, tightest RAM (mobile / edge / embedded) 110m (this repo)
English, best WER, ~600 M params cstr/parakeet-tdt-0.6b-v2-GGUF
Multilingual (25 EU languages) cstr/parakeet-tdt-0.6b-v3-GGUF
English, long-tail vocab cstr/parakeet-tdt-1.1b-GGUF

Model architecture

Component Details
Encoder 17-layer FastConformer, d=512, 8 heads, head_dim=64, FFN=2048, conv kernel=9
Subsampling Conv2d dw_striding stack, 8Γ— temporal (100 β†’ 12.5 fps)
Predictor 1-layer LSTM, hidden 640 (smaller than the standard 2-LSTM)
Joint head enc(512 β†’ 640) + pred(640 β†’ 640) β†’ ReLU β†’ linear(640 β†’ 1029) β€” TDT, 5 durations
CTC head linear(512 β†’ 1025) β€” used by default at runtime
Vocab 1024 SentencePiece tokens (English) + blank
Audio 16 kHz mono, 80 mel bins, n_fft=512, hop=160, win=400
Parameters ~110 M

Why the runtime auto-flips to CTC

Hybrid TDT+CTC models normally let the user choose between the two decoders. But this checkpoint ships with pred_layers=1 β€” a single LSTM rather than the usual two β€” which is too shallow for the TDT prediction network to produce useful joint scores. Upstream's recommended decode is CTC for this checkpoint. The C++ runtime detects this at load (pred_layers < 2 && has_ctc) and sets decode_ctc=true automatically. You can opt back into TDT (not recommended) by removing that auto-flip in parakeet_init_from_file.

Attribution

License

CC-BY-4.0, inherited from the base model.

Downloads last month
164
GGUF
Model size
0.1B params
Architecture
parakeet
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/parakeet-tdt_ctc-110m-GGUF

Quantized
(15)
this model