| --- |
| license: cc-by-4.0 |
| language: |
| - en |
| tags: |
| - kenlm |
| - n-gram |
| - language-model |
| - speech-recognition |
| - parakeet |
| - tdt |
| - asr |
| - shallow-fusion |
| - vernacula |
| library_name: kenlm |
| --- |
| |
| # KenLM n-gram LMs for Parakeet TDT shallow fusion |
|
|
| Subword-level KenLM ARPAs built over the [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
| tokenizer, intended for use with [Vernacula](https://github.com/christopherthompson81/vernacula)'s |
| Parakeet beam decoder via shallow LM fusion. Each file's "words" are Parakeet |
| subword IDs (integers), not natural-language words — this lets the decoder |
| score the LM at every beam expansion without any subword-to-word round-trip. |
|
|
| Other Parakeet checkpoints share vocabulary layouts so these LMs may work |
| against them too, but only v3 has been validated. |
|
|
| ## Files |
|
|
| | File | Corpus | Order | Size (gz) | Target register | |
| |---|---|---|---|---| |
| | `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) | 4 | 67 MB | Conversational English with cased + punctuated output | |
| | `en-medical.arpa.gz` | 5× MTSamples + 2× synthetic clinical dialogue + 2× class-aware drug dialogue (~77 M effective words) | 3 | **17 MB** | Medical English — clinical dictation + patient↔doctor dialogue + specialty drug names | |
|
|
| More languages / domains coming as they're validated. |
|
|
| ## Design: speech-register per-domain corpus |
|
|
| Early iterations of `en-medical` mixed MTSamples (spoken dictation) with |
| HealthCareMagic (patient-written forum Q&A) and PubMed abstracts (formal |
| written prose), then layered that over an `en-general` base. This |
| performed worse than plain `en-general` on medical-entity F1. |
|
|
| Investigation showed shallow fusion is a **style predictor, not a |
| knowledge predictor**: the LM biases the decoder toward sequences it's |
| seen, and written-prose medical text pushes the decoder into patterns |
| that don't match conversational clinical audio. The "general base" layer |
| was helping purely because GigaSpeech and People's Speech are spoken |
| transcripts — *not* because of their domain coverage. |
|
|
| The current `en-medical` is therefore built entirely from spoken-register |
| medical content: |
|
|
| - **MTSamples** (~2.2 M words, natural clinical dictation: H&P, SOAP notes, |
| op reports) — upweighted 5× to compensate for its small raw size. |
| - **CodCodingCode/cleaned-clinical-conversations** (~25 M deduped words of |
| synthetic doctor↔patient dialogue spanning dozens of specialties and |
| presentations) — upweighted 2×. |
|
|
| No `en-general` base, no forum text, no journal abstracts. The result |
| matches earlier layered variants on fluency and disease recall while |
| being 3–4× smaller. |
|
|
| ## Specialty drug coverage via class-aware template dialogue |
|
|
| MTSamples + the synthetic-dialogue corpus use mostly lay drug names |
| ("paracetamol", "ibuprofen"); specialty drugs (`sertraline`, |
| `olanzapine`, `apixaban`, `paclitaxel`) appear rarely. Earlier log-prob |
| probes showed weak per-token priors on these (−2.0 to −3.0, vs |
| −0.7 to −1.0 for well-covered phrases). |
|
|
| The current build adds 2× of template-generated speech-register drug |
| dialogue, filled from a curated catalog of ~330 generic drug names |
| across 35 WHO-ATC-style classes paired with class-appropriate |
| conditions: |
|
|
| "started on lithium for bipolar disorder" |
| "I've been taking apixaban for my atrial fibrillation" |
| "patient is on paclitaxel for breast cancer" |
| "can I get a refill of my sertraline" |
| |
| Class-aware pairing means we don't produce implausible combinations |
| like "lithium for sinusitis". This lifts specialty-drug log-probs to |
| the -0.7 to -1.2 range (comparable to well-covered general phrases) |
| without degrading general fluency. |
|
|
| PriMock-57 validation ties the earlier variant on entity F1 (since |
| PriMock audio doesn't contain specialty drug mentions to exercise the |
| improved priors). Users with dictated psychiatry / oncology / |
| cardiology content should see the benefit directly. |
|
|
| ## Usage |
|
|
| ### In Vernacula (recommended) |
|
|
| Settings → Speech Recognition → Language model → pick "English (General)". |
| The app downloads the file automatically and wires it into the beam decoder. |
|
|
| ### From the Vernacula CLI |
|
|
| ```bash |
| vernacula --audio sample.wav --model <parakeet-dir> \ |
| --lm en-general.arpa.gz \ |
| --lm-weight 0.15 |
| ``` |
|
|
| Passing `--lm` auto-bumps beam width to 4 (fusion has no effect in greedy |
| mode). Typical fusion weight is 0.1–0.3; 0.15 is the default used by |
| Vernacula's Settings picker. |
|
|
| ### Directly with the KenLM Python bindings |
|
|
| ```python |
| import kenlm |
| lm = kenlm.Model("en-general.arpa.gz") |
| # tokens must be Parakeet subword IDs stringified: |
| score = lm.score("42 17 5", bos=False, eos=False) |
| ``` |
|
|
| ## How `en-medical.arpa.gz` was built |
|
|
| ``` |
| # Extract MTSamples (galileo-ai/medical_transcription_40, text column). |
| # Extract CodCodingCode/cleaned-clinical-conversations, stripping |
| # DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows. |
| # Generate ~8M words of class-aware drug dialogue using the curated |
| # drug_classes.json catalog (see scripts/kenlm_build/ in Vernacula). |
| # Layer: |
| # (for _ in 1..5; cat mtsamples.txt; end |
| # for _ in 1..2; cat synthetic-dialogue.txt; end |
| # for _ in 1..2; cat drug-dialogue.txt; end) > en-medical.corpus.txt |
| # Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then |
| # lmplz --order 3 --prune "0 0 1" --discount_fallback |
| # gzip the ARPA. |
| ``` |
|
|
| Upstream corpora have public-redistribution precedent in the medical NLP |
| literature; attribution is required under the CC-BY-4.0 umbrella of this |
| derivative. |
|
|
| ## How `en-general.arpa.gz` was built |
|
|
| ``` |
| # Corpus: |
| # 3x GigaSpeech `s` subset (Apache 2.0, cased + punctuation-tagged) |
| # 1x People's Speech `clean` subset (CC-BY-4.0, lowercase, no punct) |
| # Rationale: People's Speech carries conversational-register priors |
| # (backchannels, disfluencies). GigaSpeech carries the case + punctuation |
| # style Parakeet's output expects. 3x upweight on GigaSpeech balances the |
| # raw size asymmetry so the case signal survives without being drowned out. |
| |
| # Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json: |
| # ~27M subwords across ~1M sentences. |
| |
| # Built with KenLM's lmplz: |
| lmplz -o 4 --prune 0 0 1 1 --discount_fallback \ |
| --vocab_estimate 8193 \ |
| --text en-mixed.tok --arpa en-general.arpa |
| gzip en-general.arpa |
| ``` |
|
|
| See Vernacula's [`scripts/kenlm_build/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/kenlm_build) for the exact scripts. |
|
|
| ## Validation |
|
|
| On a 600 s en-US conversational sample (157 VAD segments, held-out from |
| the corpus), Vernacula's Parakeet decoder exhibits a known beam=4 |
| multilingual-drift regression where an English backchannel "Uh uh. Ya." |
| transcribes as Spanish "ajá, ya". With this LM fused at weight 0.15 the |
| line recovers to "Uh uh, yeah." and ~500 other lines stay unchanged vs |
| greedy decoding — proper nouns preserved, punctuation preserved. |
|
|
| ## License & attribution |
|
|
| This LM is a derivative of its training corpora. It's released under |
| **CC-BY-4.0** (the union of its sources' terms). |
|
|
| Upstream corpora (attribution required): |
|
|
| - [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech) |
| — transcripts portion, CC-BY-4.0. (en-general) |
| - [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech) |
| — transcripts portion, Apache 2.0. (en-general) |
| - [galileo-ai/medical_transcription_40](https://huggingface.co/datasets/galileo-ai/medical_transcription_40) |
| — MTSamples mirror, clinical dictation. (en-medical) |
| - [CodCodingCode/cleaned-clinical-conversations](https://huggingface.co/datasets/CodCodingCode/cleaned-clinical-conversations) |
| — synthetic (LLM-generated) doctor↔patient dialogues across a broad |
| range of conditions and specialties. (en-medical) |
|
|
| ## Caveats |
|
|
| - Subword-ID-keyed. Not a word-level LM — you can't load it in a word-level |
| ASR decoder without first mapping through Parakeet's tokenizer. |
| - 4-gram with `--prune 0 0 1 1` (drops 3-grams and 4-grams seen exactly |
| once). If you need tighter priors, rebuild with `--prune 0 0 0 0` at the |
| cost of roughly 3× file size. |
| - Best at capturing local lexical choice (e.g. "uh huh" vs "ajá"). Doesn't |
| carry sentence-level priors that might rescue wholly-ambiguous short |
| utterances (e.g. a standalone "ajá" with no context). |
|
|