File size: 8,353 Bytes
d56a20c 78779c1 d56a20c 78779c1 cc8a084 d56a20c 78779c1 82066f3 78779c1 82066f3 78779c1 82066f3 78779c1 82066f3 78779c1 cc8a084 82066f3 d56a20c d893e53 78779c1 cc8a084 78779c1 cc8a084 78779c1 d893e53 d56a20c d893e53 d56a20c d893e53 78779c1 d56a20c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | ---
license: cc-by-4.0
language:
- en
tags:
- kenlm
- n-gram
- language-model
- speech-recognition
- parakeet
- tdt
- asr
- shallow-fusion
- vernacula
library_name: kenlm
---
# KenLM n-gram LMs for Parakeet TDT shallow fusion
Subword-level KenLM ARPAs built over the [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
tokenizer, intended for use with [Vernacula](https://github.com/christopherthompson81/vernacula)'s
Parakeet beam decoder via shallow LM fusion. Each file's "words" are Parakeet
subword IDs (integers), not natural-language words — this lets the decoder
score the LM at every beam expansion without any subword-to-word round-trip.
Other Parakeet checkpoints share vocabulary layouts so these LMs may work
against them too, but only v3 has been validated.
## Files
| File | Corpus | Order | Size (gz) | Target register |
|---|---|---|---|---|
| `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) | 4 | 67 MB | Conversational English with cased + punctuated output |
| `en-medical.arpa.gz` | 5× MTSamples + 2× synthetic clinical dialogue + 2× class-aware drug dialogue (~77 M effective words) | 3 | **17 MB** | Medical English — clinical dictation + patient↔doctor dialogue + specialty drug names |
More languages / domains coming as they're validated.
## Design: speech-register per-domain corpus
Early iterations of `en-medical` mixed MTSamples (spoken dictation) with
HealthCareMagic (patient-written forum Q&A) and PubMed abstracts (formal
written prose), then layered that over an `en-general` base. This
performed worse than plain `en-general` on medical-entity F1.
Investigation showed shallow fusion is a **style predictor, not a
knowledge predictor**: the LM biases the decoder toward sequences it's
seen, and written-prose medical text pushes the decoder into patterns
that don't match conversational clinical audio. The "general base" layer
was helping purely because GigaSpeech and People's Speech are spoken
transcripts — *not* because of their domain coverage.
The current `en-medical` is therefore built entirely from spoken-register
medical content:
- **MTSamples** (~2.2 M words, natural clinical dictation: H&P, SOAP notes,
op reports) — upweighted 5× to compensate for its small raw size.
- **CodCodingCode/cleaned-clinical-conversations** (~25 M deduped words of
synthetic doctor↔patient dialogue spanning dozens of specialties and
presentations) — upweighted 2×.
No `en-general` base, no forum text, no journal abstracts. The result
matches earlier layered variants on fluency and disease recall while
being 3–4× smaller.
## Specialty drug coverage via class-aware template dialogue
MTSamples + the synthetic-dialogue corpus use mostly lay drug names
("paracetamol", "ibuprofen"); specialty drugs (`sertraline`,
`olanzapine`, `apixaban`, `paclitaxel`) appear rarely. Earlier log-prob
probes showed weak per-token priors on these (−2.0 to −3.0, vs
−0.7 to −1.0 for well-covered phrases).
The current build adds 2× of template-generated speech-register drug
dialogue, filled from a curated catalog of ~330 generic drug names
across 35 WHO-ATC-style classes paired with class-appropriate
conditions:
"started on lithium for bipolar disorder"
"I've been taking apixaban for my atrial fibrillation"
"patient is on paclitaxel for breast cancer"
"can I get a refill of my sertraline"
Class-aware pairing means we don't produce implausible combinations
like "lithium for sinusitis". This lifts specialty-drug log-probs to
the -0.7 to -1.2 range (comparable to well-covered general phrases)
without degrading general fluency.
PriMock-57 validation ties the earlier variant on entity F1 (since
PriMock audio doesn't contain specialty drug mentions to exercise the
improved priors). Users with dictated psychiatry / oncology /
cardiology content should see the benefit directly.
## Usage
### In Vernacula (recommended)
Settings → Speech Recognition → Language model → pick "English (General)".
The app downloads the file automatically and wires it into the beam decoder.
### From the Vernacula CLI
```bash
vernacula --audio sample.wav --model <parakeet-dir> \
--lm en-general.arpa.gz \
--lm-weight 0.15
```
Passing `--lm` auto-bumps beam width to 4 (fusion has no effect in greedy
mode). Typical fusion weight is 0.1–0.3; 0.15 is the default used by
Vernacula's Settings picker.
### Directly with the KenLM Python bindings
```python
import kenlm
lm = kenlm.Model("en-general.arpa.gz")
# tokens must be Parakeet subword IDs stringified:
score = lm.score("42 17 5", bos=False, eos=False)
```
## How `en-medical.arpa.gz` was built
```
# Extract MTSamples (galileo-ai/medical_transcription_40, text column).
# Extract CodCodingCode/cleaned-clinical-conversations, stripping
# DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows.
# Generate ~8M words of class-aware drug dialogue using the curated
# drug_classes.json catalog (see scripts/kenlm_build/ in Vernacula).
# Layer:
# (for _ in 1..5; cat mtsamples.txt; end
# for _ in 1..2; cat synthetic-dialogue.txt; end
# for _ in 1..2; cat drug-dialogue.txt; end) > en-medical.corpus.txt
# Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
# lmplz --order 3 --prune "0 0 1" --discount_fallback
# gzip the ARPA.
```
Upstream corpora have public-redistribution precedent in the medical NLP
literature; attribution is required under the CC-BY-4.0 umbrella of this
derivative.
## How `en-general.arpa.gz` was built
```
# Corpus:
# 3x GigaSpeech `s` subset (Apache 2.0, cased + punctuation-tagged)
# 1x People's Speech `clean` subset (CC-BY-4.0, lowercase, no punct)
# Rationale: People's Speech carries conversational-register priors
# (backchannels, disfluencies). GigaSpeech carries the case + punctuation
# style Parakeet's output expects. 3x upweight on GigaSpeech balances the
# raw size asymmetry so the case signal survives without being drowned out.
# Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json:
# ~27M subwords across ~1M sentences.
# Built with KenLM's lmplz:
lmplz -o 4 --prune 0 0 1 1 --discount_fallback \
--vocab_estimate 8193 \
--text en-mixed.tok --arpa en-general.arpa
gzip en-general.arpa
```
See Vernacula's [`scripts/kenlm_build/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/kenlm_build) for the exact scripts.
## Validation
On a 600 s en-US conversational sample (157 VAD segments, held-out from
the corpus), Vernacula's Parakeet decoder exhibits a known beam=4
multilingual-drift regression where an English backchannel "Uh uh. Ya."
transcribes as Spanish "ajá, ya". With this LM fused at weight 0.15 the
line recovers to "Uh uh, yeah." and ~500 other lines stay unchanged vs
greedy decoding — proper nouns preserved, punctuation preserved.
## License & attribution
This LM is a derivative of its training corpora. It's released under
**CC-BY-4.0** (the union of its sources' terms).
Upstream corpora (attribution required):
- [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
— transcripts portion, CC-BY-4.0. (en-general)
- [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
— transcripts portion, Apache 2.0. (en-general)
- [galileo-ai/medical_transcription_40](https://huggingface.co/datasets/galileo-ai/medical_transcription_40)
— MTSamples mirror, clinical dictation. (en-medical)
- [CodCodingCode/cleaned-clinical-conversations](https://huggingface.co/datasets/CodCodingCode/cleaned-clinical-conversations)
— synthetic (LLM-generated) doctor↔patient dialogues across a broad
range of conditions and specialties. (en-medical)
## Caveats
- Subword-ID-keyed. Not a word-level LM — you can't load it in a word-level
ASR decoder without first mapping through Parakeet's tokenizer.
- 4-gram with `--prune 0 0 1 1` (drops 3-grams and 4-grams seen exactly
once). If you need tighter priors, rebuild with `--prune 0 0 0 0` at the
cost of roughly 3× file size.
- Best at capturing local lexical choice (e.g. "uh huh" vs "ajá"). Doesn't
carry sentence-level priors that might rescue wholly-ambiguous short
utterances (e.g. a standalone "ajá" with no context).
|