christopherthompson81 commited on
Commit
d893e53
·
verified ·
1 Parent(s): 08f2ebd

Document en-medical

Browse files
Files changed (1) hide show
  1. README.md +34 -2
README.md CHANGED
@@ -31,6 +31,7 @@ against them too, but only v3 has been validated.
31
  | File | Corpus mix | Order | Size (gz) | Target register |
32
  |---|---|---|---|---|
33
  | `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M subwords) | 4 | 67 MB | Conversational English with cased + punctuated output |
 
34
 
35
  More languages / domains coming as they're validated.
36
 
@@ -62,6 +63,29 @@ lm = kenlm.Model("en-general.arpa.gz")
62
  score = lm.score("42 17 5", bos=False, eos=False)
63
  ```
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ## How `en-general.arpa.gz` was built
66
 
67
  ```
@@ -102,9 +126,17 @@ This LM is a derivative of its training corpora. It's released under
102
  Upstream corpora (attribution required):
103
 
104
  - [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
105
- — transcripts portion, CC-BY-4.0.
106
  - [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
107
- — transcripts portion, Apache 2.0.
 
 
 
 
 
 
 
 
108
 
109
  ## Caveats
110
 
 
31
  | File | Corpus mix | Order | Size (gz) | Target register |
32
  |---|---|---|---|---|
33
  | `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M subwords) | 4 | 67 MB | Conversational English with cased + punctuated output |
34
+ | `en-medical.arpa.gz` | MTSamples + HealthCareMagic med-dialog + PubMed abstracts (~22 M words) | 4 | 63 MB | Clinical dictation + patient↔doctor dialogue + biomedical vocabulary |
35
 
36
  More languages / domains coming as they're validated.
37
 
 
63
  score = lm.score("42 17 5", bos=False, eos=False)
64
  ```
65
 
66
+ ## How `en-medical.arpa.gz` was built
67
+
68
+ ```
69
+ # Corpus:
70
+ # MTSamples (galileo-ai/medical_transcription_40 mirror — ~2.2 M words)
71
+ # Clinical dictation style: SOAP notes, H&P, op reports.
72
+ # HealthCareMagic (lighteval/med_dialog/healthcaremagic — ~15 M words)
73
+ # Patient ↔ doctor Q&A dialogue.
74
+ # PubMed abstracts (MedRAG/pubmed, capped — ~5 M words)
75
+ # Formal biomedical prose; vocabulary coverage only.
76
+ #
77
+ # No upweighting. The three sources are roughly complementary in register:
78
+ # dictation for "how a clinician talks about a case", dialogue for
79
+ # conversational medical register, abstracts for specialty vocabulary.
80
+
81
+ # Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json.
82
+ # Built with KenLM's lmplz, same flags as en-general.
83
+ ```
84
+
85
+ Upstream corpora have public-redistribution precedent in the medical NLP
86
+ literature; attribution is required under the CC-BY-4.0 umbrella of this
87
+ derivative.
88
+
89
  ## How `en-general.arpa.gz` was built
90
 
91
  ```
 
126
  Upstream corpora (attribution required):
127
 
128
  - [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
129
+ — transcripts portion, CC-BY-4.0. (en-general)
130
  - [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
131
+ — transcripts portion, Apache 2.0. (en-general)
132
+ - [galileo-ai/medical_transcription_40](https://huggingface.co/datasets/galileo-ai/medical_transcription_40)
133
+ — MTSamples mirror. (en-medical)
134
+ - [lighteval/med_dialog](https://huggingface.co/datasets/lighteval/med_dialog)
135
+ — HealthCareMagic config; originally He et al., *MedDialog: Large-scale
136
+ Medical Dialogue Datasets* (EMNLP 2020). (en-medical)
137
+ - [MedRAG/pubmed](https://huggingface.co/datasets/MedRAG/pubmed)
138
+ — PubMed abstracts, US National Library of Medicine public data.
139
+ (en-medical)
140
 
141
  ## Caveats
142