File size: 7,643 Bytes
c16df54 214520e c16df54 214520e c16df54 214520e c16df54 214520e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | ---
license: mit
language:
- en
tags:
- text-classification
- medical
- nhs
- clinical-letters
- distilbert
pipeline_tag: text-classification
---
# NHS Medical Letter Classifier
Fine-tuned **DistilBERT** (`distilbert-base-uncased`) for classifying OCR'd NHS medical clinic letters into 49 letter type categories.
## Model Details
| Parameter | Value |
|---|---|
| Base model | `distilbert-base-uncased` |
| Training samples | 13,672 |
| Classes | 49 |
| Epochs | 6 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Max sequence length | 512 tokens |
| Cleanlab corrections | 212 labels relabeled (1.6% of dataset) |
## How We Got Here: Experiment Journey
### 1. Baseline: TF-IDF + LinearSVC
- **Approach:** TfidfVectorizer (unigram+bigram, 50k features) with CalibratedClassifierCV(LinearSVC)
- **Result:** ~91% accuracy on the original label set
- **Takeaway:** Strong baseline, but limited by bag-of-words representation
### 2. Label Merging (Critical Improvement)
- **Approach:** Consolidated synonymous labels (e.g., "Nephrology" to "Renal", "Minor Illness Consultation" to "Pharmacy") and dropped ambiguous/administrative labels
- **Result:** Accuracy jumped from ~91% to ~96%
- **Takeaway:** Label quality matters more than model architecture. Reduced label set from ~51 to 49 meaningful categories
### 3. DistilBERT Baseline (Our Core Model)
- **Approach:** Fine-tuned `distilbert-base-uncased`, 4 epochs, 512 tokens, 70/10/20 stratified split
- **Result:** Top-1: 95.76% | Top-3: 98.06% | Top-5: 98.61%
- **Takeaway:** Strong performance, established as the baseline for all further experiments
### 4. ClinicalBERT & BioClinicalBERT
- **Approach:** Tested domain-specific models (`medicalai/ClinicalBERT`, `emilyalsentzer/Bio_ClinicalBERT`)
- **Result:** Similar to DistilBERT (~95-96%), no meaningful improvement
- **Takeaway:** General-purpose DistilBERT captures enough for this task; domain pre-training didn't help
### 5. Longformer (1024 tokens)
- **Approach:** `allenai/longformer-base-4096` at 1024 tokens with global attention on CLS, case-sensitive
- **Result:** Comparable to DistilBERT at 512 tokens
- **Takeaway:** Most discriminative information is in the first 512 tokens; longer context doesn't help
### 6. Hierarchical Architecture
- **Approach:** Two-stage: DistilBERT body for CLS embeddings, per-clinic LogisticRegression heads. 51 fine labels mapped to 25 broad categories
- **Result:** Did not outperform flat DistilBERT
- **Takeaway:** The flat classification space works well; hierarchical routing adds complexity without benefit
### 7. LLM Relabeling (GPT-5-mini)
- **Approach:** Used OpenAI Batch API to get GPT-5-mini to reclassify all 13,672 samples. Trained DistilBERT on LLM-assigned labels
- **Result:** 86.22% vs original labels | 93.24% vs LLM labels (Top-1)
- **Takeaway:** LLM agrees with original labels ~85.7% of the time. LLM labels are different but not better — the original clinical labels carry domain knowledge the LLM lacks
### 8. Consensus Relabeling
- **Approach:** Only change labels where both BERT and GPT-5-mini agree the original label is wrong
- **Result:** Only 4 out of 9,569 samples met the consensus criteria
- **Takeaway:** BERT memorizes its training labels, so it almost never disagrees with originals on training data. Consensus is too strict
### 9. Soft Knowledge Distillation
- **Approach:** Got GPT-5-mini top-5 predictions with confidence scores as soft labels. Trained with blended loss: alpha * CE(hard) + (1-alpha) * KL(soft || student), alpha=0.5
- **Result:** Top-1: 95.32% (-0.44pp) | Top-3: 97.48% (-0.58pp)
- **Takeaway:** LLM self-reported confidence scores are too noisy/uniform. Soft KL loss stayed flat at ~3.5. Would need actual logprobs for this to work
### 10. Cleanlab: Remove Mislabeled Samples
- **Approach:** Confident learning (Northcutt et al. 2021). 3-fold cross-validation for out-of-sample probabilities, then `find_label_issues()` to detect mislabeled samples. Removed 142 flagged training samples and retrained
- **Result:** Top-1: 95.90% (+0.14pp) | Top-3: 97.70% (-0.36pp)
- **Takeaway:** Small top-1 gain, but removing ambiguous samples hurt ranked predictions. Manual inspection confirmed ~99% of flagged samples were genuinely mislabeled
### 11. Cleanlab: Relabel Instead of Remove
- **Approach:** Same cleanlab detection, but replaced wrong labels with model's predicted label instead of removing samples
- **Result (vs original test labels):** Top-1: 95.80% | Top-3: 97.92% | Top-5: 98.46%
- **Result (vs corrected test labels):** Top-1: 98.06% | Top-3: 99.09% | Top-5: 99.38%
- **Takeaway:** The ~2pp gap between original and corrected evaluation reveals that the remaining "errors" are mostly test set noise, not model mistakes. True model performance is ~98% top-1
### 12. Production Model (This Model)
- **Approach:** Fresh 3-fold cleanlab on the **entire** dataset (13,672 samples). Found 212 mislabeled samples (1.6%), relabeled all. Trained on full corrected dataset for 6 epochs
- **Sanity check:** 99.74% accuracy on training data (expected, since model saw all data)
- **Estimated true accuracy:** ~98% top-1, ~99% top-3 based on corrected-label evaluation
## Key Findings
1. **Label quality > model architecture.** Label merging (+5pp) and cleanlab corrections (+2pp true accuracy) had more impact than any model change
2. **DistilBERT is sufficient.** Domain-specific models (ClinicalBERT, BioClinicalBERT) and longer context (Longformer) didn't help
3. **~1.6% of labels are wrong.** Discharge summary (9.1%), Paediatrics (7.2%), and Physiotherapy (6.8%) are the noisiest classes
4. **The model is better than naive metrics suggest.** When evaluated against corrected labels, top-1 jumps from ~96% to ~98%
## Labels (49 classes)
- `A&E`
- `Ambulance Notification`
- `Audiology`
- `Bowel Cancer Screening`
- `Breast Clinic`
- `Cancer Screening`
- `Cardiology`
- `Colposcopy`
- `Dermatology`
- `Diabetes & Endocrine`
- `Diet Services`
- `Discharge summary`
- `ENT`
- `Echocardiogram`
- `Elderly Care`
- `Gastroenterology`
- `General Surgery`
- `Genetics`
- `Haematology`
- `INR`
- `Immunology`
- `Mammogram`
- `Maternity`
- `Maxillofacial`
- `Mental Health`
- `Neurology`
- `Neurosurgery`
- `Obstetrics & Gynaecology`
- `Oncology`
- `Ophthalmology`
- `Orthopaedics`
- `Out of Hours`
- `Paediatrics`
- `Pain Management`
- `Pharmacy`
- `Physiotherapy`
- `Plastic Surgery`
- `Radiology`
- `Renal`
- `Respiratory`
- `Retinal Screening`
- `Rheumatology`
- `Sexual Health`
- `Speech and Language Therapy`
- `Stroke Services`
- `Urgent Care Centre`
- `Urology`
- `Vascular`
- `Walk in Centre`
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch, json
model = AutoModelForSequenceClassification.from_pretrained("mansour94/kynoby-william-bert-classifier")
tokenizer = AutoTokenizer.from_pretrained("mansour94/kynoby-william-bert-classifier")
# Load label map
from huggingface_hub import hf_hub_download
label_map = json.load(open(hf_hub_download("mansour94/kynoby-william-bert-classifier", "label_map.json")))
id2label = {int(k): v for k, v in label_map["id2label"].items()}
text = "Dear Dr Smith, I am writing to inform you about the patient's ophthalmology appointment..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
# Top-3 predictions
top3 = torch.topk(probs, 3)
for i in range(3):
idx = top3.indices[0][i].item()
conf = top3.values[0][i].item()
print(f" {id2label[idx]}: {conf:.1%}")
```
|