Parakeet-DE-Med — GGUF (ggml-quantised)

GGUF / ggml conversions of johannhartmann/parakeet_de_med for use with the parakeet-main CLI from CrispStrobe/CrispASR@parakeet.

parakeet_de_med is Johann Hartmann's PEFT decoder+joint fine-tune of nvidia/parakeet-tdt-0.6b-v3 specialised for German medical documentation (Arztbriefe). On the German medical test set it scores 3.28% WER vs the base model's 11.73% — a 72% relative reduction.

The fine-tune freezes the encoder and trains only the TDT decoder + joint head (18.1M out of 627M parameters, 2.89%). This means:

  • The architecture is identical to parakeet-tdt-0.6b-v3 (24-layer FastConformer encoder, 2-layer LSTM predictor, 8198-class TDT joint head)
  • The same GGUF converter, runtime, and CLI work as-is
  • The frozen encoder still uses the base model's auto-language detection — for clean German speech this works well, for accented or noisy audio you may want to fall back to a different runtime (see comparison table)

Files

File Size Notes
parakeet_de_med.gguf 1.26 GB F16, full precision
parakeet_de_med-q8_0.gguf 711 MB Q8_0, near-lossless
parakeet_de_med-q5_0.gguf 516 MB Q5_0
parakeet_de_med-q4_k.gguf 467 MB Q4_K — recommended default

All quantisations produce the same text on the German verification clip:

Leider zu spät. Leider zu spät.

Quick start

# 1. Build the runtime
git clone -b parakeet https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target parakeet-main

# 2. Download a quantisation
huggingface-cli download cstr/parakeet_de_med-GGUF \
    parakeet_de_med-q4_k.gguf --local-dir .

# 3. Transcribe German audio
./build/bin/parakeet-main \
    -m parakeet_de_med-q4_k.gguf \
    -f german_audio.wav -t 8

The runtime is the same parakeet-main binary used for the base parakeet-tdt-0.6b-v3. All the usual flags work: -vad-model for Silero VAD slicing, -ck N for fixed chunking, -ml N for max chars per line, -osrt/-ovtt/-ot for subtitle output, -v for per-token timestamps via the TDT duration head.

Word-level timestamps

Like the base parakeet model, this fine-tune emits TDT durations as part of decoding, so word-level timestamps come for free at one encoder frame = 80 ms granularity. No separate forced alignment model needed:

$ ./build/bin/parakeet-main -m parakeet_de_med-q4_k.gguf -f german.wav -t 8 -v
[ 0.32s →  0.64s]  Der
[ 0.64s →  1.04s]  Patient
[ 1.04s →  1.32s]  klagt
[ 1.32s →  1.92s]  über
...

Which runtime should I use?

For German speech specifically:

Use case Right tool
German medical documentation parakeet_de_med-q4_k.gguf ← this repo
General German ASR with explicit language control canary-1b-v2-q4_k.gguf (-sl de -tl de)
German → English translation canary-1b-v2-q4_k.gguf (-sl de -tl en)
General multilingual ASR (auto-detect) parakeet-tdt-0.6b-v3-q4_k.gguf
Lowest English WER cohere-transcribe-q4_k.gguf

Architecture (inherited from base)

Component Details
Encoder 24-layer FastConformer (frozen), d=1024, 8 heads, head_dim=128, FFN=4096, conv kernel=9
Subsampling Conv2d dw_striding stack, 8× temporal (100 → 12.5 fps)
Predictor 2-layer LSTM, hidden 640, embed 8193 × 640 (fine-tuned)
Joint head enc(1024 → 640) + pred(640 → 640) → ReLU → linear(640 → 8198) (fine-tuned)
Vocab 8192 SentencePiece tokens (multilingual, but generation biased toward German medical)
Audio 16 kHz mono, 128 mel bins, n_fft=512, hop=160, win=400
Parameters 627M total, 18.1M trained (2.89%)

Attribution

Related

License

CC-BY-4.0, inherited from both the base model and the fine-tune. Use of these GGUF files must comply with the CC-BY-4.0 license including attribution to NVIDIA NeMo team and Johann Hartmann.

Downloads last month
205
GGUF
Model size
0.6B params
Architecture
parakeet
Hardware compatibility
Log In to add your hardware

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cstr/parakeet_de_med-GGUF

Quantized
(1)
this model