Document en-medical
Browse files
README.md
CHANGED
|
@@ -31,6 +31,7 @@ against them too, but only v3 has been validated.
|
|
| 31 |
| File | Corpus mix | Order | Size (gz) | Target register |
|
| 32 |
|---|---|---|---|---|
|
| 33 |
| `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M subwords) | 4 | 67 MB | Conversational English with cased + punctuated output |
|
|
|
|
| 34 |
|
| 35 |
More languages / domains coming as they're validated.
|
| 36 |
|
|
@@ -62,6 +63,29 @@ lm = kenlm.Model("en-general.arpa.gz")
|
|
| 62 |
score = lm.score("42 17 5", bos=False, eos=False)
|
| 63 |
```
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
## How `en-general.arpa.gz` was built
|
| 66 |
|
| 67 |
```
|
|
@@ -102,9 +126,17 @@ This LM is a derivative of its training corpora. It's released under
|
|
| 102 |
Upstream corpora (attribution required):
|
| 103 |
|
| 104 |
- [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
|
| 105 |
-
— transcripts portion, CC-BY-4.0.
|
| 106 |
- [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
|
| 107 |
-
— transcripts portion, Apache 2.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
## Caveats
|
| 110 |
|
|
|
|
| 31 |
| File | Corpus mix | Order | Size (gz) | Target register |
|
| 32 |
|---|---|---|---|---|
|
| 33 |
| `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M subwords) | 4 | 67 MB | Conversational English with cased + punctuated output |
|
| 34 |
+
| `en-medical.arpa.gz` | MTSamples + HealthCareMagic med-dialog + PubMed abstracts (~22 M words) | 4 | 63 MB | Clinical dictation + patient↔doctor dialogue + biomedical vocabulary |
|
| 35 |
|
| 36 |
More languages / domains coming as they're validated.
|
| 37 |
|
|
|
|
| 63 |
score = lm.score("42 17 5", bos=False, eos=False)
|
| 64 |
```
|
| 65 |
|
| 66 |
+
## How `en-medical.arpa.gz` was built
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
# Corpus:
|
| 70 |
+
# MTSamples (galileo-ai/medical_transcription_40 mirror — ~2.2 M words)
|
| 71 |
+
# Clinical dictation style: SOAP notes, H&P, op reports.
|
| 72 |
+
# HealthCareMagic (lighteval/med_dialog/healthcaremagic — ~15 M words)
|
| 73 |
+
# Patient ↔ doctor Q&A dialogue.
|
| 74 |
+
# PubMed abstracts (MedRAG/pubmed, capped — ~5 M words)
|
| 75 |
+
# Formal biomedical prose; vocabulary coverage only.
|
| 76 |
+
#
|
| 77 |
+
# No upweighting. The three sources are roughly complementary in register:
|
| 78 |
+
# dictation for "how a clinician talks about a case", dialogue for
|
| 79 |
+
# conversational medical register, abstracts for specialty vocabulary.
|
| 80 |
+
|
| 81 |
+
# Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json.
|
| 82 |
+
# Built with KenLM's lmplz, same flags as en-general.
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
Upstream corpora have public-redistribution precedent in the medical NLP
|
| 86 |
+
literature; attribution is required under the CC-BY-4.0 umbrella of this
|
| 87 |
+
derivative.
|
| 88 |
+
|
| 89 |
## How `en-general.arpa.gz` was built
|
| 90 |
|
| 91 |
```
|
|
|
|
| 126 |
Upstream corpora (attribution required):
|
| 127 |
|
| 128 |
- [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
|
| 129 |
+
— transcripts portion, CC-BY-4.0. (en-general)
|
| 130 |
- [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
|
| 131 |
+
— transcripts portion, Apache 2.0. (en-general)
|
| 132 |
+
- [galileo-ai/medical_transcription_40](https://huggingface.co/datasets/galileo-ai/medical_transcription_40)
|
| 133 |
+
— MTSamples mirror. (en-medical)
|
| 134 |
+
- [lighteval/med_dialog](https://huggingface.co/datasets/lighteval/med_dialog)
|
| 135 |
+
— HealthCareMagic config; originally He et al., *MedDialog: Large-scale
|
| 136 |
+
Medical Dialogue Datasets* (EMNLP 2020). (en-medical)
|
| 137 |
+
- [MedRAG/pubmed](https://huggingface.co/datasets/MedRAG/pubmed)
|
| 138 |
+
— PubMed abstracts, US National Library of Medicine public data.
|
| 139 |
+
(en-medical)
|
| 140 |
|
| 141 |
## Caveats
|
| 142 |
|