christopherthompson81
/

kenlm-parakeet

@@ -31,6 +31,7 @@ against them too, but only v3 has been validated.
 | File | Corpus mix | Order | Size (gz) | Target register |
 |---|---|---|---|---|
 | `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M subwords) | 4 | 67 MB | Conversational English with cased + punctuated output |
 More languages / domains coming as they're validated.
@@ -62,6 +63,29 @@ lm = kenlm.Model("en-general.arpa.gz")
 score = lm.score("42 17 5", bos=False, eos=False)
 ```
 ## How `en-general.arpa.gz` was built
 ```
@@ -102,9 +126,17 @@ This LM is a derivative of its training corpora. It's released under
 Upstream corpora (attribution required):
 - [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
-  — transcripts portion, CC-BY-4.0.
 - [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
-  — transcripts portion, Apache 2.0.
 ## Caveats

 | File | Corpus mix | Order | Size (gz) | Target register |
 |---|---|---|---|---|
 | `en-general.arpa.gz` | 3× GigaSpeech `s` + 1× People's Speech `clean` (~22 M subwords) | 4 | 67 MB | Conversational English with cased + punctuated output |
+| `en-medical.arpa.gz` | MTSamples + HealthCareMagic med-dialog + PubMed abstracts (~22 M words) | 4 | 63 MB | Clinical dictation + patient↔doctor dialogue + biomedical vocabulary |
 More languages / domains coming as they're validated.
 score = lm.score("42 17 5", bos=False, eos=False)
 ```
+## How `en-medical.arpa.gz` was built
+```
+# Corpus:
+#   MTSamples            (galileo-ai/medical_transcription_40 mirror — ~2.2 M words)
+#                        Clinical dictation style: SOAP notes, H&P, op reports.
+#   HealthCareMagic      (lighteval/med_dialog/healthcaremagic — ~15 M words)
+#                        Patient ↔ doctor Q&A dialogue.
+#   PubMed abstracts     (MedRAG/pubmed, capped — ~5 M words)
+#                        Formal biomedical prose; vocabulary coverage only.
+#
+# No upweighting. The three sources are roughly complementary in register:
+# dictation for "how a clinician talks about a case", dialogue for
+# conversational medical register, abstracts for specialty vocabulary.
+# Tokenised with nvidia/parakeet-tdt-0.6b-v3 tokenizer.json.
+# Built with KenLM's lmplz, same flags as en-general.
+```
+Upstream corpora have public-redistribution precedent in the medical NLP
+literature; attribution is required under the CC-BY-4.0 umbrella of this
+derivative.
 ## How `en-general.arpa.gz` was built
 ```
 Upstream corpora (attribution required):
 - [MLCommons/peoples_speech](https://huggingface.co/datasets/MLCommons/peoples_speech)
+  — transcripts portion, CC-BY-4.0. (en-general)
 - [speechcolab/gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
+  — transcripts portion, Apache 2.0. (en-general)
+- [galileo-ai/medical_transcription_40](https://huggingface.co/datasets/galileo-ai/medical_transcription_40)
+  — MTSamples mirror. (en-medical)
+- [lighteval/med_dialog](https://huggingface.co/datasets/lighteval/med_dialog)
+  — HealthCareMagic config; originally He et al., *MedDialog: Large-scale
+  Medical Dialogue Datasets* (EMNLP 2020). (en-medical)
+- [MedRAG/pubmed](https://huggingface.co/datasets/MedRAG/pubmed)
+  — PubMed abstracts, US National Library of Medicine public data.
+  (en-medical)
 ## Caveats