Document class-aware drug dialogue addition
Browse files
README.md
CHANGED
|
@@ -31,7 +31,7 @@ against them too, but only v3 has been validated.
|
|
| 31 |
| File | Corpus | Order | Size (gz) | Target register |
|
| 32 |
|---|---|---|---|---|
|
| 33 |
| `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) | 4 | 67 MB | Conversational English with cased + punctuated output |
|
| 34 |
-
| `en-medical.arpa.gz` | 5× MTSamples clinical
|
| 35 |
|
| 36 |
More languages / domains coming as they're validated.
|
| 37 |
|
|
@@ -62,11 +62,33 @@ No `en-general` base, no forum text, no journal abstracts. The result
|
|
| 62 |
matches earlier layered variants on fluency and disease recall while
|
| 63 |
being 3–4× smaller.
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
(
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
## Usage
|
| 72 |
|
|
@@ -101,11 +123,13 @@ score = lm.score("42 17 5", bos=False, eos=False)
|
|
| 101 |
```
|
| 102 |
# Extract MTSamples (galileo-ai/medical_transcription_40, text column).
|
| 103 |
# Extract CodCodingCode/cleaned-clinical-conversations, stripping
|
| 104 |
-
# DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows
|
| 105 |
-
#
|
|
|
|
| 106 |
# Layer:
|
| 107 |
# (for _ in 1..5; cat mtsamples.txt; end
|
| 108 |
-
# for _ in 1..2; cat synthetic-dialogue.txt; end
|
|
|
|
| 109 |
# Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
|
| 110 |
# lmplz --order 3 --prune "0 0 1" --discount_fallback
|
| 111 |
# gzip the ARPA.
|
|
|
|
| 31 |
| File | Corpus | Order | Size (gz) | Target register |
|
| 32 |
|---|---|---|---|---|
|
| 33 |
| `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) | 4 | 67 MB | Conversational English with cased + punctuated output |
|
| 34 |
+
| `en-medical.arpa.gz` | 5× MTSamples + 2× synthetic clinical dialogue + 2× class-aware drug dialogue (~77 M effective words) | 3 | **17 MB** | Medical English — clinical dictation + patient↔doctor dialogue + specialty drug names |
|
| 35 |
|
| 36 |
More languages / domains coming as they're validated.
|
| 37 |
|
|
|
|
| 62 |
matches earlier layered variants on fluency and disease recall while
|
| 63 |
being 3–4× smaller.
|
| 64 |
|
| 65 |
+
## Specialty drug coverage via class-aware template dialogue
|
| 66 |
+
|
| 67 |
+
MTSamples + the synthetic-dialogue corpus use mostly lay drug names
|
| 68 |
+
("paracetamol", "ibuprofen"); specialty drugs (`sertraline`,
|
| 69 |
+
`olanzapine`, `apixaban`, `paclitaxel`) appear rarely. Earlier log-prob
|
| 70 |
+
probes showed weak per-token priors on these (−2.0 to −3.0, vs
|
| 71 |
+
−0.7 to −1.0 for well-covered phrases).
|
| 72 |
+
|
| 73 |
+
The current build adds 2× of template-generated speech-register drug
|
| 74 |
+
dialogue, filled from a curated catalog of ~330 generic drug names
|
| 75 |
+
across 35 WHO-ATC-style classes paired with class-appropriate
|
| 76 |
+
conditions:
|
| 77 |
+
|
| 78 |
+
"started on lithium for bipolar disorder"
|
| 79 |
+
"I've been taking apixaban for my atrial fibrillation"
|
| 80 |
+
"patient is on paclitaxel for breast cancer"
|
| 81 |
+
"can I get a refill of my sertraline"
|
| 82 |
+
|
| 83 |
+
Class-aware pairing means we don't produce implausible combinations
|
| 84 |
+
like "lithium for sinusitis". This lifts specialty-drug log-probs to
|
| 85 |
+
the -0.7 to -1.2 range (comparable to well-covered general phrases)
|
| 86 |
+
without degrading general fluency.
|
| 87 |
+
|
| 88 |
+
PriMock-57 validation ties the earlier variant on entity F1 (since
|
| 89 |
+
PriMock audio doesn't contain specialty drug mentions to exercise the
|
| 90 |
+
improved priors). Users with dictated psychiatry / oncology /
|
| 91 |
+
cardiology content should see the benefit directly.
|
| 92 |
|
| 93 |
## Usage
|
| 94 |
|
|
|
|
| 123 |
```
|
| 124 |
# Extract MTSamples (galileo-ai/medical_transcription_40, text column).
|
| 125 |
# Extract CodCodingCode/cleaned-clinical-conversations, stripping
|
| 126 |
+
# DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows.
|
| 127 |
+
# Generate ~8M words of class-aware drug dialogue using the curated
|
| 128 |
+
# drug_classes.json catalog (see scripts/kenlm_build/ in Vernacula).
|
| 129 |
# Layer:
|
| 130 |
# (for _ in 1..5; cat mtsamples.txt; end
|
| 131 |
+
# for _ in 1..2; cat synthetic-dialogue.txt; end
|
| 132 |
+
# for _ in 1..2; cat drug-dialogue.txt; end) > en-medical.corpus.txt
|
| 133 |
# Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
|
| 134 |
# lmplz --order 3 --prune "0 0 1" --discount_fallback
|
| 135 |
# gzip the ARPA.
|