Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,199 +1,179 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
-
|
| 8 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
| 9 |
-
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
## Model Details
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
- **
|
| 34 |
-
- **
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
###
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
##
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
#
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
**BibTeX:**
|
| 176 |
-
|
| 177 |
-
[More Information Needed]
|
| 178 |
-
|
| 179 |
-
**APA:**
|
| 180 |
-
|
| 181 |
-
[More Information Needed]
|
| 182 |
-
|
| 183 |
-
## Glossary [optional]
|
| 184 |
-
|
| 185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 186 |
-
|
| 187 |
-
[More Information Needed]
|
| 188 |
-
|
| 189 |
-
## More Information [optional]
|
| 190 |
-
|
| 191 |
-
[More Information Needed]
|
| 192 |
-
|
| 193 |
-
## Model Card Authors [optional]
|
| 194 |
-
|
| 195 |
-
[More Information Needed]
|
| 196 |
-
|
| 197 |
-
## Model Card Contact
|
| 198 |
-
|
| 199 |
-
[More Information Needed]
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- text-classification
|
| 7 |
+
- medical
|
| 8 |
+
- nhs
|
| 9 |
+
- clinical-letters
|
| 10 |
+
- distilbert
|
| 11 |
+
pipeline_tag: text-classification
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# NHS Medical Letter Classifier
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
Fine-tuned **DistilBERT** (`distilbert-base-uncased`) for classifying OCR'd NHS medical clinic letters into 49 letter type categories.
|
| 17 |
|
| 18 |
## Model Details
|
| 19 |
|
| 20 |
+
| Parameter | Value |
|
| 21 |
+
|---|---|
|
| 22 |
+
| Base model | `distilbert-base-uncased` |
|
| 23 |
+
| Training samples | 13,672 |
|
| 24 |
+
| Classes | 49 |
|
| 25 |
+
| Epochs | 6 |
|
| 26 |
+
| Batch size | 16 |
|
| 27 |
+
| Learning rate | 2e-5 |
|
| 28 |
+
| Max sequence length | 512 tokens |
|
| 29 |
+
| Cleanlab corrections | 212 labels relabeled (1.6% of dataset) |
|
| 30 |
+
|
| 31 |
+
## How We Got Here: Experiment Journey
|
| 32 |
+
|
| 33 |
+
### 1. Baseline: TF-IDF + LinearSVC
|
| 34 |
+
- **Approach:** TfidfVectorizer (unigram+bigram, 50k features) with CalibratedClassifierCV(LinearSVC)
|
| 35 |
+
- **Result:** ~91% accuracy on the original label set
|
| 36 |
+
- **Takeaway:** Strong baseline, but limited by bag-of-words representation
|
| 37 |
+
|
| 38 |
+
### 2. Label Merging (Critical Improvement)
|
| 39 |
+
- **Approach:** Consolidated synonymous labels (e.g., "Nephrology" to "Renal", "Minor Illness Consultation" to "Pharmacy") and dropped ambiguous/administrative labels
|
| 40 |
+
- **Result:** Accuracy jumped from ~91% to ~96%
|
| 41 |
+
- **Takeaway:** Label quality matters more than model architecture. Reduced label set from ~51 to 49 meaningful categories
|
| 42 |
+
|
| 43 |
+
### 3. DistilBERT Baseline (Our Core Model)
|
| 44 |
+
- **Approach:** Fine-tuned `distilbert-base-uncased`, 4 epochs, 512 tokens, 70/10/20 stratified split
|
| 45 |
+
- **Result:** Top-1: 95.76% | Top-3: 98.06% | Top-5: 98.61%
|
| 46 |
+
- **Takeaway:** Strong performance, established as the baseline for all further experiments
|
| 47 |
+
|
| 48 |
+
### 4. ClinicalBERT & BioClinicalBERT
|
| 49 |
+
- **Approach:** Tested domain-specific models (`medicalai/ClinicalBERT`, `emilyalsentzer/Bio_ClinicalBERT`)
|
| 50 |
+
- **Result:** Similar to DistilBERT (~95-96%), no meaningful improvement
|
| 51 |
+
- **Takeaway:** General-purpose DistilBERT captures enough for this task; domain pre-training didn't help
|
| 52 |
+
|
| 53 |
+
### 5. Longformer (1024 tokens)
|
| 54 |
+
- **Approach:** `allenai/longformer-base-4096` at 1024 tokens with global attention on CLS, case-sensitive
|
| 55 |
+
- **Result:** Comparable to DistilBERT at 512 tokens
|
| 56 |
+
- **Takeaway:** Most discriminative information is in the first 512 tokens; longer context doesn't help
|
| 57 |
+
|
| 58 |
+
### 6. Hierarchical Architecture
|
| 59 |
+
- **Approach:** Two-stage: DistilBERT body for CLS embeddings, per-clinic LogisticRegression heads. 51 fine labels mapped to 25 broad categories
|
| 60 |
+
- **Result:** Did not outperform flat DistilBERT
|
| 61 |
+
- **Takeaway:** The flat classification space works well; hierarchical routing adds complexity without benefit
|
| 62 |
+
|
| 63 |
+
### 7. LLM Relabeling (GPT-5-mini)
|
| 64 |
+
- **Approach:** Used OpenAI Batch API to get GPT-5-mini to reclassify all 13,672 samples. Trained DistilBERT on LLM-assigned labels
|
| 65 |
+
- **Result:** 86.22% vs original labels | 93.24% vs LLM labels (Top-1)
|
| 66 |
+
- **Takeaway:** LLM agrees with original labels ~85.7% of the time. LLM labels are different but not better — the original clinical labels carry domain knowledge the LLM lacks
|
| 67 |
+
|
| 68 |
+
### 8. Consensus Relabeling
|
| 69 |
+
- **Approach:** Only change labels where both BERT and GPT-5-mini agree the original label is wrong
|
| 70 |
+
- **Result:** Only 4 out of 9,569 samples met the consensus criteria
|
| 71 |
+
- **Takeaway:** BERT memorizes its training labels, so it almost never disagrees with originals on training data. Consensus is too strict
|
| 72 |
+
|
| 73 |
+
### 9. Soft Knowledge Distillation
|
| 74 |
+
- **Approach:** Got GPT-5-mini top-5 predictions with confidence scores as soft labels. Trained with blended loss: alpha * CE(hard) + (1-alpha) * KL(soft || student), alpha=0.5
|
| 75 |
+
- **Result:** Top-1: 95.32% (-0.44pp) | Top-3: 97.48% (-0.58pp)
|
| 76 |
+
- **Takeaway:** LLM self-reported confidence scores are too noisy/uniform. Soft KL loss stayed flat at ~3.5. Would need actual logprobs for this to work
|
| 77 |
+
|
| 78 |
+
### 10. Cleanlab: Remove Mislabeled Samples
|
| 79 |
+
- **Approach:** Confident learning (Northcutt et al. 2021). 3-fold cross-validation for out-of-sample probabilities, then `find_label_issues()` to detect mislabeled samples. Removed 142 flagged training samples and retrained
|
| 80 |
+
- **Result:** Top-1: 95.90% (+0.14pp) | Top-3: 97.70% (-0.36pp)
|
| 81 |
+
- **Takeaway:** Small top-1 gain, but removing ambiguous samples hurt ranked predictions. Manual inspection confirmed ~99% of flagged samples were genuinely mislabeled
|
| 82 |
+
|
| 83 |
+
### 11. Cleanlab: Relabel Instead of Remove
|
| 84 |
+
- **Approach:** Same cleanlab detection, but replaced wrong labels with model's predicted label instead of removing samples
|
| 85 |
+
- **Result (vs original test labels):** Top-1: 95.80% | Top-3: 97.92% | Top-5: 98.46%
|
| 86 |
+
- **Result (vs corrected test labels):** Top-1: 98.06% | Top-3: 99.09% | Top-5: 99.38%
|
| 87 |
+
- **Takeaway:** The ~2pp gap between original and corrected evaluation reveals that the remaining "errors" are mostly test set noise, not model mistakes. True model performance is ~98% top-1
|
| 88 |
+
|
| 89 |
+
### 12. Production Model (This Model)
|
| 90 |
+
- **Approach:** Fresh 3-fold cleanlab on the **entire** dataset (13,672 samples). Found 212 mislabeled samples (1.6%), relabeled all. Trained on full corrected dataset for 6 epochs
|
| 91 |
+
- **Sanity check:** 99.74% accuracy on training data (expected, since model saw all data)
|
| 92 |
+
- **Estimated true accuracy:** ~98% top-1, ~99% top-3 based on corrected-label evaluation
|
| 93 |
+
|
| 94 |
+
## Key Findings
|
| 95 |
+
|
| 96 |
+
1. **Label quality > model architecture.** Label merging (+5pp) and cleanlab corrections (+2pp true accuracy) had more impact than any model change
|
| 97 |
+
2. **DistilBERT is sufficient.** Domain-specific models (ClinicalBERT, BioClinicalBERT) and longer context (Longformer) didn't help
|
| 98 |
+
3. **~1.6% of labels are wrong.** Discharge summary (9.1%), Paediatrics (7.2%), and Physiotherapy (6.8%) are the noisiest classes
|
| 99 |
+
4. **The model is better than naive metrics suggest.** When evaluated against corrected labels, top-1 jumps from ~96% to ~98%
|
| 100 |
+
|
| 101 |
+
## Labels (49 classes)
|
| 102 |
+
|
| 103 |
+
- `A&E`
|
| 104 |
+
- `Ambulance Notification`
|
| 105 |
+
- `Audiology`
|
| 106 |
+
- `Bowel Cancer Screening`
|
| 107 |
+
- `Breast Clinic`
|
| 108 |
+
- `Cancer Screening`
|
| 109 |
+
- `Cardiology`
|
| 110 |
+
- `Colposcopy`
|
| 111 |
+
- `Dermatology`
|
| 112 |
+
- `Diabetes & Endocrine`
|
| 113 |
+
- `Diet Services`
|
| 114 |
+
- `Discharge summary`
|
| 115 |
+
- `ENT`
|
| 116 |
+
- `Echocardiogram`
|
| 117 |
+
- `Elderly Care`
|
| 118 |
+
- `Gastroenterology`
|
| 119 |
+
- `General Surgery`
|
| 120 |
+
- `Genetics`
|
| 121 |
+
- `Haematology`
|
| 122 |
+
- `INR`
|
| 123 |
+
- `Immunology`
|
| 124 |
+
- `Mammogram`
|
| 125 |
+
- `Maternity`
|
| 126 |
+
- `Maxillofacial`
|
| 127 |
+
- `Mental Health`
|
| 128 |
+
- `Neurology`
|
| 129 |
+
- `Neurosurgery`
|
| 130 |
+
- `Obstetrics & Gynaecology`
|
| 131 |
+
- `Oncology`
|
| 132 |
+
- `Ophthalmology`
|
| 133 |
+
- `Orthopaedics`
|
| 134 |
+
- `Out of Hours`
|
| 135 |
+
- `Paediatrics`
|
| 136 |
+
- `Pain Management`
|
| 137 |
+
- `Pharmacy`
|
| 138 |
+
- `Physiotherapy`
|
| 139 |
+
- `Plastic Surgery`
|
| 140 |
+
- `Radiology`
|
| 141 |
+
- `Renal`
|
| 142 |
+
- `Respiratory`
|
| 143 |
+
- `Retinal Screening`
|
| 144 |
+
- `Rheumatology`
|
| 145 |
+
- `Sexual Health`
|
| 146 |
+
- `Speech and Language Therapy`
|
| 147 |
+
- `Stroke Services`
|
| 148 |
+
- `Urgent Care Centre`
|
| 149 |
+
- `Urology`
|
| 150 |
+
- `Vascular`
|
| 151 |
+
- `Walk in Centre`
|
| 152 |
+
|
| 153 |
+
## Usage
|
| 154 |
+
|
| 155 |
+
```python
|
| 156 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 157 |
+
import torch, json
|
| 158 |
+
|
| 159 |
+
model = AutoModelForSequenceClassification.from_pretrained("mansour94/kynoby-william-bert-classifier")
|
| 160 |
+
tokenizer = AutoTokenizer.from_pretrained("mansour94/kynoby-william-bert-classifier")
|
| 161 |
+
|
| 162 |
+
# Load label map
|
| 163 |
+
from huggingface_hub import hf_hub_download
|
| 164 |
+
label_map = json.load(open(hf_hub_download("mansour94/kynoby-william-bert-classifier", "label_map.json")))
|
| 165 |
+
id2label = {int(k): v for k, v in label_map["id2label"].items()}
|
| 166 |
+
|
| 167 |
+
text = "Dear Dr Smith, I am writing to inform you about the patient's ophthalmology appointment..."
|
| 168 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
| 169 |
+
with torch.no_grad():
|
| 170 |
+
logits = model(**inputs).logits
|
| 171 |
+
probs = torch.softmax(logits, dim=-1)
|
| 172 |
+
|
| 173 |
+
# Top-3 predictions
|
| 174 |
+
top3 = torch.topk(probs, 3)
|
| 175 |
+
for i in range(3):
|
| 176 |
+
idx = top3.indices[0][i].item()
|
| 177 |
+
conf = top3.values[0][i].item()
|
| 178 |
+
print(f" {id2label[idx]}: {conf:.1%}")
|
| 179 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|