File size: 8,353 Bytes
d56a20c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78779c1
d56a20c
78779c1
cc8a084
d56a20c
 
 
78779c1
82066f3
78779c1
 
 
 
82066f3
78779c1
 
 
 
 
 
82066f3
78779c1
 
82066f3
78779c1
 
 
 
 
 
 
 
 
 
cc8a084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82066f3
d56a20c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d893e53
 
 
78779c1
 
cc8a084
 
 
78779c1
 
cc8a084
 
78779c1
 
 
d893e53
 
 
 
 
 
d56a20c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d893e53
d56a20c
d893e53
 
78779c1
 
 
 
d56a20c
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
license: cc-by-4.0
language:
  - en
tags:
  - kenlm
  - n-gram
  - language-model
  - speech-recognition
  - parakeet
  - tdt
  - asr
  - shallow-fusion
  - vernacula
library_name: kenlm
---

# KenLM n-gram LMs for Parakeet TDT shallow fusion

Subword-level KenLM ARPAs built over the [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
tokenizer, intended for use with [Vernacula](https://github.com/christopherthompson81/vernacula)'s
Parakeet beam decoder via shallow LM fusion. Each file's "words" are Parakeet
subword IDs (integers), not natural-language words — this lets the decoder
score the LM at every beam expansion without any subword-to-word round-trip.

Other Parakeet checkpoints share vocabulary layouts so these LMs may work
against them too, but only v3 has been validated.

## Files

| File | Corpus | Order | Size (gz) | Target register |
|---|---|---|---|---|
| `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) | 4 | 67 MB | Conversational English with cased + punctuated output |
| `en-medical.arpa.gz` | 5× MTSamples + 2× synthetic clinical dialogue + 2× class-aware drug dialogue (~77 M effective words) | 3 | **17 MB** | Medical English — clinical dictation + patient↔doctor dialogue + specialty drug names |

More languages / domains coming as they're validated.

## Design: speech-register per-domain corpus

Early iterations of `en-medical` mixed MTSamples (spoken dictation) with
HealthCareMagic (patient-written forum Q&A) and PubMed abstracts (formal
written prose), then layered that over an `en-general` base. This
performed worse than plain `en-general` on medical-entity F1.

Investigation showed shallow fusion is a **style predictor, not a
knowledge predictor**: the LM biases the decoder toward sequences it's
seen, and written-prose medical text pushes the decoder into patterns
that don't match conversational clinical audio. The "general base" layer
was helping purely because GigaSpeech and People's Speech are spoken
transcripts — *not* because of their domain coverage.

The current `en-medical` is therefore built entirely from spoken-register
medical content:

- **MTSamples** (~2.2 M words, natural clinical dictation: H&P, SOAP notes,
  op reports) — upweighted 5× to compensate for its small raw size.
- **CodCodingCode/cleaned-clinical-conversations** (~25 M deduped words of
  synthetic doctor↔patient dialogue spanning dozens of specialties and
  presentations) — upweighted 2×.

No `en-general` base, no forum text, no journal abstracts. The result
matches earlier layered variants on fluency and disease recall while
being 3–4× smaller.

## Specialty drug coverage via class-aware template dialogue

MTSamples + the synthetic-dialogue corpus use mostly lay drug names
("paracetamol", "ibuprofen"); specialty drugs (`sertraline`,
`olanzapine`, `apixaban`, `paclitaxel`) appear rarely. Earlier log-prob
probes showed weak per-token priors on these (−2.0 to −3.0, vs
−0.7 to −1.0 for well-covered phrases).

The current build adds 2× of template-generated speech-register drug
dialogue, filled from a curated catalog of ~330 generic drug names
across 35 WHO-ATC-style classes paired with class-appropriate
conditions:

    "started on lithium for bipolar disorder"
    "I've been taking apixaban for my atrial fibrillation"
    "patient is on paclitaxel for breast cancer"
    "can I get a refill of my sertraline"

Class-aware pairing means we don't produce implausible combinations
like "lithium for sinusitis". This lifts specialty-drug log-probs to
the -0.7 to -1.2 range (comparable to well-covered general phrases)
without degrading general fluency.

PriMock-57 validation ties the earlier variant on entity F1 (since
PriMock audio doesn't contain specialty drug mentions to exercise the
improved priors). Users with dictated psychiatry / oncology /
cardiology content should see the benefit directly.

## Usage

### In Vernacula (recommended)

Settings → Speech Recognition → Language model → pick "English (General)".
The app downloads the file automatically and wires it into the beam decoder.

### From the Vernacula CLI

```bash
vernacula --audio sample.wav --model <parakeet-dir> \
  --lm en-general.arpa.gz \
  --lm-weight 0.15
```

Passing `--lm` auto-bumps beam width to 4 (fusion has no effect in greedy
mode). Typical fusion weight is 0.1–0.3; 0.15 is the default used by
Vernacula's Settings picker.

### Directly with the KenLM Python bindings

```python
import kenlm
lm = kenlm.Model("en-general.arpa.gz")
# tokens must be Parakeet subword IDs stringified:
score = lm.score("42 17 5", bos=False, eos=False)
```

## How `en-medical.arpa.gz` was built

```
# Extract MTSamples (galileo-ai/medical_transcription_40, text column).
# Extract CodCodingCode/cleaned-clinical-conversations, stripping
#   DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows.
# Generate ~8M words of class-aware drug dialogue using the curated
#   drug_classes.json catalog (see scripts/kenlm_build/ in Vernacula).
# Layer:
#   (for _ in 1..5; cat mtsamples.txt; end
#    for _ in 1..2; cat synthetic-dialogue.txt; end
#    for _ in 1..2; cat drug-dialogue.txt; end) > en-medical.corpus.txt
# Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
#   lmplz --order 3 --prune "0 0 1" --discount_fallback
# gzip the ARPA.
```

Upstream corpora have public-redistribution precedent in the medical NLP
literature; attribution is required under the CC-BY-4.0 umbrella of this
derivative.

## How `en-general.arpa.gz` was built

```
# Corpus:
#   3x GigaSpeech `s` subset  (Apache 2.0, cased + punctuation-tagged)
#   1x People's Speech `clean` subset  (CC-BY-4.0, lowercase, no punct)
# Rationale: People's Speech carries conversational-register priors
# (backchannels, disfluencies). GigaSpeech carries the case + punctuation
# style Parakeet's output expects. 3x upweight on GigaSpeech balances the
# raw size asymmetry so the case signal survives without being drowned out.

# Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json:
#   ~27M subwords across ~1M sentences.

# Built with KenLM's lmplz:
lmplz -o 4 --prune 0 0 1 1 --discount_fallback \
      --vocab_estimate 8193 \
      --text en-mixed.tok --arpa en-general.arpa
gzip en-general.arpa
```

See Vernacula's [`scripts/kenlm_build/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/kenlm_build) for the exact scripts.

## Validation

On a 600 s en-US conversational sample (157 VAD segments, held-out from
the corpus), Vernacula's Parakeet decoder exhibits a known beam=4
multilingual-drift regression where an English backchannel "Uh uh. Ya."
transcribes as Spanish "ajá, ya". With this LM fused at weight 0.15 the
line recovers to "Uh uh, yeah." and ~500 other lines stay unchanged vs
greedy decoding — proper nouns preserved, punctuation preserved.

## License & attribution

This LM is a derivative of its training corpora. It's released under
**CC-BY-4.0** (the union of its sources' terms).

Upstream corpora (attribution required):

- [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
  — transcripts portion, CC-BY-4.0. (en-general)
- [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
  — transcripts portion, Apache 2.0. (en-general)
- [galileo-ai/medical_transcription_40](https://huggingface.co/datasets/galileo-ai/medical_transcription_40)
  — MTSamples mirror, clinical dictation. (en-medical)
- [CodCodingCode/cleaned-clinical-conversations](https://huggingface.co/datasets/CodCodingCode/cleaned-clinical-conversations)
  — synthetic (LLM-generated) doctor↔patient dialogues across a broad
  range of conditions and specialties. (en-medical)

## Caveats

- Subword-ID-keyed. Not a word-level LM — you can't load it in a word-level
  ASR decoder without first mapping through Parakeet's tokenizer.
- 4-gram with `--prune 0 0 1 1` (drops 3-grams and 4-grams seen exactly
  once). If you need tighter priors, rebuild with `--prune 0 0 0 0` at the
  cost of roughly 3× file size.
- Best at capturing local lexical choice (e.g. "uh huh" vs "ajá"). Doesn't
  carry sentence-level priors that might rescue wholly-ambiguous short
  utterances (e.g. a standalone "ajá" with no context).