christopherthompson81 commited on
Commit
cc8a084
·
verified ·
1 Parent(s): 80ee86f

Document class-aware drug dialogue addition

Browse files
Files changed (1) hide show
  1. README.md +33 -9
README.md CHANGED
@@ -31,7 +31,7 @@ against them too, but only v3 has been validated.
31
  | File | Corpus | Order | Size (gz) | Target register |
32
  |---|---|---|---|---|
33
  | `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) | 4 | 67 MB | Conversational English with cased + punctuated output |
34
- | `en-medical.arpa.gz` | 5× MTSamples clinical dictation + 2× synthetic clinical dialogue (~61 M effective words) | 3 | **17 MB** | Medical English — clinical dictation + patient↔doctor dialogue register |
35
 
36
  More languages / domains coming as they're validated.
37
 
@@ -62,11 +62,33 @@ No `en-general` base, no forum text, no journal abstracts. The result
62
  matches earlier layered variants on fluency and disease recall while
63
  being 3–4× smaller.
64
 
65
- Remaining gap: specialty-drug vocabulary. MTSamples doesn't emphasise
66
- drug names, and synthetic dialogue uses mostly lay names. A future
67
- iteration will inject template-generated drug-dialogue text
68
- (e.g. "I was prescribed {drug} for my {condition}" with RxNorm +
69
- ICD-10 fills) to close this gap without corrupting the speech register.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ## Usage
72
 
@@ -101,11 +123,13 @@ score = lm.score("42 17 5", bos=False, eos=False)
101
  ```
102
  # Extract MTSamples (galileo-ai/medical_transcription_40, text column).
103
  # Extract CodCodingCode/cleaned-clinical-conversations, stripping
104
- # DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows
105
- # (scripts/kenlm_build/extract_synthetic_dialogue.py in the Vernacula repo).
 
106
  # Layer:
107
  # (for _ in 1..5; cat mtsamples.txt; end
108
- # for _ in 1..2; cat synthetic-dialogue.txt; end) > en-medical.corpus.txt
 
109
  # Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
110
  # lmplz --order 3 --prune "0 0 1" --discount_fallback
111
  # gzip the ARPA.
 
31
  | File | Corpus | Order | Size (gz) | Target register |
32
  |---|---|---|---|---|
33
  | `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M effective words) | 4 | 67 MB | Conversational English with cased + punctuated output |
34
+ | `en-medical.arpa.gz` | 5× MTSamples + 2× synthetic clinical dialogue + 2× class-aware drug dialogue (~77 M effective words) | 3 | **17 MB** | Medical English — clinical dictation + patient↔doctor dialogue + specialty drug names |
35
 
36
  More languages / domains coming as they're validated.
37
 
 
62
  matches earlier layered variants on fluency and disease recall while
63
  being 3–4× smaller.
64
 
65
+ ## Specialty drug coverage via class-aware template dialogue
66
+
67
+ MTSamples + the synthetic-dialogue corpus use mostly lay drug names
68
+ ("paracetamol", "ibuprofen"); specialty drugs (`sertraline`,
69
+ `olanzapine`, `apixaban`, `paclitaxel`) appear rarely. Earlier log-prob
70
+ probes showed weak per-token priors on these (−2.0 to −3.0, vs
71
+ −0.7 to −1.0 for well-covered phrases).
72
+
73
+ The current build adds 2× of template-generated speech-register drug
74
+ dialogue, filled from a curated catalog of ~330 generic drug names
75
+ across 35 WHO-ATC-style classes paired with class-appropriate
76
+ conditions:
77
+
78
+ "started on lithium for bipolar disorder"
79
+ "I've been taking apixaban for my atrial fibrillation"
80
+ "patient is on paclitaxel for breast cancer"
81
+ "can I get a refill of my sertraline"
82
+
83
+ Class-aware pairing means we don't produce implausible combinations
84
+ like "lithium for sinusitis". This lifts specialty-drug log-probs to
85
+ the -0.7 to -1.2 range (comparable to well-covered general phrases)
86
+ without degrading general fluency.
87
+
88
+ PriMock-57 validation ties the earlier variant on entity F1 (since
89
+ PriMock audio doesn't contain specialty drug mentions to exercise the
90
+ improved priors). Users with dictated psychiatry / oncology /
91
+ cardiology content should see the benefit directly.
92
 
93
  ## Usage
94
 
 
123
  ```
124
  # Extract MTSamples (galileo-ai/medical_transcription_40, text column).
125
  # Extract CodCodingCode/cleaned-clinical-conversations, stripping
126
+ # DOCTOR:/PATIENT: prefixes and deduping turns across overlapping rows.
127
+ # Generate ~8M words of class-aware drug dialogue using the curated
128
+ # drug_classes.json catalog (see scripts/kenlm_build/ in Vernacula).
129
  # Layer:
130
  # (for _ in 1..5; cat mtsamples.txt; end
131
+ # for _ in 1..2; cat synthetic-dialogue.txt; end
132
+ # for _ in 1..2; cat drug-dialogue.txt; end) > en-medical.corpus.txt
133
  # Tokenise with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json, then
134
  # lmplz --order 3 --prune "0 0 1" --discount_fallback
135
  # gzip the ARPA.