Document class-aware drug dialogue addition

cc8a084 verified about 1 month ago

8.35 kB

	---
	license: cc-by-4.0
	language:
	- en
	tags:
	- kenlm
	- n-gram
	- language-model
	- speech-recognition
	- parakeet
	- tdt
	- asr
	- shallow-fusion
	- vernacula
	library_name: kenlm
	---

	# KenLM n-gram LMs for Parakeet TDT shallow fusion

	Subword-level KenLM ARPAs built over the [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
	tokenizer, intended for use with [Vernacula](https://github.com/christopherthompson81/vernacula)'s
	Parakeet beam decoder via shallow LM fusion. Each file's "words" are Parakeet
	subword IDs (integers), not natural-language words — this lets the decoder
	score the LM at every beam expansion without any subword-to-word round-trip.

	Other Parakeet checkpoints share vocabulary layouts so these LMs may work
	against them too, but only v3 has been validated.

	## Files

	\| File \| Corpus \| Order \| Size (gz) \| Target register \|
	\|---\|---\|---\|---\|---\|
	\| `en-general.arpa.gz` \| 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) \| 4 \| 67 MB \| Conversational English with cased + punctuated output \|
	\| `en-medical.arpa.gz` \| 5× MTSamples + 2× synthetic clinical dialogue + 2× class-aware drug dialogue (~77 M effective words) \| 3 \| 17 MB \| Medical English — clinical dictation + patient↔doctor dialogue + specialty drug names \|

	More languages / domains coming as they're validated.

	## Design: speech-register per-domain corpus

	Early iterations of `en-medical` mixed MTSamples (spoken dictation) with
	HealthCareMagic (patient-written forum Q&A) and PubMed abstracts (formal
	written prose), then layered that over an `en-general` base. This
	performed worse than plain `en-general` on medical-entity F1.

	Investigation showed shallow fusion is a **style predictor, not a
	knowledge predictor**: the LM biases the decoder toward sequences it's
	seen, and written-prose medical text pushes the decoder into patterns
	that don't match conversational clinical audio. The "general base" layer
	was helping purely because GigaSpeech and People's Speech are spoken
	transcripts — not because of their domain coverage.

	The current `en-medical` is therefore built entirely from spoken-register
	medical content:

	- MTSamples (~2.2 M words, natural clinical dictation: H&P, SOAP notes,
	op reports) — upweighted 5× to compensate for its small raw size.
	- CodCodingCode/cleaned-clinical-conversations (~25 M deduped words of
	synthetic doctor↔patient dialogue spanning dozens of specialties and
	presentations) — upweighted 2×.

	No `en-general` base, no forum text, no journal abstracts. The result
	matches earlier layered variants on fluency and disease recall while
	being 3–4× smaller.

	## Specialty drug coverage via class-aware template dialogue

	MTSamples + the synthetic-dialogue corpus use mostly lay drug names
	("paracetamol", "ibuprofen"); specialty drugs (`sertraline`,
	`olanzapine`, `apixaban`, `paclitaxel`) appear rarely. Earlier log-prob
	probes showed weak per-token priors on these (−2.0 to −3.0, vs
	−0.7 to −1.0 for well-covered phrases).

	The current build adds 2× of template-generated speech-register drug
	dialogue, filled from a curated catalog of ~330 generic drug names
	across 35 WHO-ATC-style classes paired with class-appropriate
	conditions:

	"started on lithium for bipolar disorder"
	"I've been taking apixaban for my atrial fibrillation"
	"patient is on paclitaxel for breast cancer"
	"can I get a refill of my sertraline"

	Class-aware pairing means we don't produce implausible combinations
	like "lithium for sinusitis". This lifts specialty-drug log-probs to
	the -0.7 to -1.2 range (comparable to well-covered general phrases)
	without degrading general fluency.

	PriMock-57 validation ties the earlier variant on entity F1 (since
	PriMock audio doesn't contain specialty drug mentions to exercise the
	improved priors). Users with dictated psychiatry / oncology /
	cardiology content should see the benefit directly.

	## Usage

	### In Vernacula (recommended)

	Settings → Speech Recognition → Language model → pick "English (General)".
	The app downloads the file automatically and wires it into the beam decoder.

	### From the Vernacula CLI

	```bash
	vernacula --audio sample.wav --model <parakeet-dir> \
	--lm en-general.arpa.gz \
	--lm-weight 0.15
	```

	Passing `--lm` auto-bumps beam width to 4 (fusion has no effect in greedy
	mode). Typical fusion weight is 0.1–0.3; 0.15 is the default used by
	Vernacula's Settings picker.

	### Directly with the KenLM Python bindings

	```python
	import kenlm
	lm = kenlm.Model("en-general.arpa.gz")
	# tokens must be Parakeet subword IDs stringified:
	score = lm.score("42 17 5", bos=False, eos=False)
	```

	## How `en-medical.arpa.gz` was built

	```
	# Extract MTSamples (galileo-ai/medical_transcription_40, text column).
	# Extract CodCodingCode/cleaned-clinical-conversations, stripping
	# DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows.
	# Generate ~8M words of class-aware drug dialogue using the curated
	# drug_classes.json catalog (see scripts/kenlm_build/ in Vernacula).
	# Layer:
	# (for _ in 1..5; cat mtsamples.txt; end
	# for _ in 1..2; cat synthetic-dialogue.txt; end
	# for _ in 1..2; cat drug-dialogue.txt; end) > en-medical.corpus.txt
	# Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
	# lmplz --order 3 --prune "0 0 1" --discount_fallback
	# gzip the ARPA.
	```

	Upstream corpora have public-redistribution precedent in the medical NLP
	literature; attribution is required under the CC-BY-4.0 umbrella of this
	derivative.

	## How `en-general.arpa.gz` was built

	```
	# Corpus:
	# 3x GigaSpeech `s` subset (Apache 2.0, cased + punctuation-tagged)
	# 1x People's Speech `clean` subset (CC-BY-4.0, lowercase, no punct)
	# Rationale: People's Speech carries conversational-register priors
	# (backchannels, disfluencies). GigaSpeech carries the case + punctuation
	# style Parakeet's output expects. 3x upweight on GigaSpeech balances the
	# raw size asymmetry so the case signal survives without being drowned out.

	# Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json:
	# ~27M subwords across ~1M sentences.

	# Built with KenLM's lmplz:
	lmplz -o 4 --prune 0 0 1 1 --discount_fallback \
	--vocab_estimate 8193 \
	--text en-mixed.tok --arpa en-general.arpa
	gzip en-general.arpa
	```

	See Vernacula's [`scripts/kenlm_build/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/kenlm_build) for the exact scripts.

	## Validation

	On a 600 s en-US conversational sample (157 VAD segments, held-out from
	the corpus), Vernacula's Parakeet decoder exhibits a known beam=4
	multilingual-drift regression where an English backchannel "Uh uh. Ya."
	transcribes as Spanish "ajá, ya". With this LM fused at weight 0.15 the
	line recovers to "Uh uh, yeah." and ~500 other lines stay unchanged vs
	greedy decoding — proper nouns preserved, punctuation preserved.

	## License & attribution

	This LM is a derivative of its training corpora. It's released under
	CC-BY-4.0 (the union of its sources' terms).

	Upstream corpora (attribution required):

	- [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
	— transcripts portion, CC-BY-4.0. (en-general)
	- [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
	— transcripts portion, Apache 2.0. (en-general)
	- [galileo-ai/medical_transcription_40](https://huggingface.co/datasets/galileo-ai/medical_transcription_40)
	— MTSamples mirror, clinical dictation. (en-medical)
	- [CodCodingCode/cleaned-clinical-conversations](https://huggingface.co/datasets/CodCodingCode/cleaned-clinical-conversations)
	— synthetic (LLM-generated) doctor↔patient dialogues across a broad
	range of conditions and specialties. (en-medical)

	## Caveats

	- Subword-ID-keyed. Not a word-level LM — you can't load it in a word-level
	ASR decoder without first mapping through Parakeet's tokenizer.
	- 4-gram with `--prune 0 0 1 1` (drops 3-grams and 4-grams seen exactly
	once). If you need tighter priors, rebuild with `--prune 0 0 0 0` at the
	cost of roughly 3× file size.
	- Best at capturing local lexical choice (e.g. "uh huh" vs "ajá"). Doesn't
	carry sentence-level priors that might rescue wholly-ambiguous short
	utterances (e.g. a standalone "ajá" with no context).