mansour94
/

kynoby-william-bert-classifier

@@ -1,199 +1,179 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: mit
+language:
+  - en
+tags:
+  - text-classification
+  - medical
+  - nhs
+  - clinical-letters
+  - distilbert
+pipeline_tag: text-classification
 ---
+# NHS Medical Letter Classifier
+Fine-tuned **DistilBERT** (`distilbert-base-uncased`) for classifying OCR'd NHS medical clinic letters into 49 letter type categories.
 ## Model Details
+| Parameter | Value |
+|---|---|
+| Base model | `distilbert-base-uncased` |
+| Training samples | 13,672 |
+| Classes | 49 |
+| Epochs | 6 |
+| Batch size | 16 |
+| Learning rate | 2e-5 |
+| Max sequence length | 512 tokens |
+| Cleanlab corrections | 212 labels relabeled (1.6% of dataset) |
+## How We Got Here: Experiment Journey
+### 1. Baseline: TF-IDF + LinearSVC
+- **Approach:** TfidfVectorizer (unigram+bigram, 50k features) with CalibratedClassifierCV(LinearSVC)
+- **Result:** ~91% accuracy on the original label set
+- **Takeaway:** Strong baseline, but limited by bag-of-words representation
+### 2. Label Merging (Critical Improvement)
+- **Approach:** Consolidated synonymous labels (e.g., "Nephrology" to "Renal", "Minor Illness Consultation" to "Pharmacy") and dropped ambiguous/administrative labels
+- **Result:** Accuracy jumped from ~91% to ~96%
+- **Takeaway:** Label quality matters more than model architecture. Reduced label set from ~51 to 49 meaningful categories
+### 3. DistilBERT Baseline (Our Core Model)
+- **Approach:** Fine-tuned `distilbert-base-uncased`, 4 epochs, 512 tokens, 70/10/20 stratified split
+- **Result:** Top-1: 95.76% | Top-3: 98.06% | Top-5: 98.61%
+- **Takeaway:** Strong performance, established as the baseline for all further experiments
+### 4. ClinicalBERT & BioClinicalBERT
+- **Approach:** Tested domain-specific models (`medicalai/ClinicalBERT`, `emilyalsentzer/Bio_ClinicalBERT`)
+- **Result:** Similar to DistilBERT (~95-96%), no meaningful improvement
+- **Takeaway:** General-purpose DistilBERT captures enough for this task; domain pre-training didn't help
+### 5. Longformer (1024 tokens)
+- **Approach:** `allenai/longformer-base-4096` at 1024 tokens with global attention on CLS, case-sensitive
+- **Result:** Comparable to DistilBERT at 512 tokens
+- **Takeaway:** Most discriminative information is in the first 512 tokens; longer context doesn't help
+### 6. Hierarchical Architecture
+- **Approach:** Two-stage: DistilBERT body for CLS embeddings, per-clinic LogisticRegression heads. 51 fine labels mapped to 25 broad categories
+- **Result:** Did not outperform flat DistilBERT
+- **Takeaway:** The flat classification space works well; hierarchical routing adds complexity without benefit
+### 7. LLM Relabeling (GPT-5-mini)
+- **Approach:** Used OpenAI Batch API to get GPT-5-mini to reclassify all 13,672 samples. Trained DistilBERT on LLM-assigned labels
+- **Result:** 86.22% vs original labels | 93.24% vs LLM labels (Top-1)
+- **Takeaway:** LLM agrees with original labels ~85.7% of the time. LLM labels are different but not better — the original clinical labels carry domain knowledge the LLM lacks
+### 8. Consensus Relabeling
+- **Approach:** Only change labels where both BERT and GPT-5-mini agree the original label is wrong
+- **Result:** Only 4 out of 9,569 samples met the consensus criteria
+- **Takeaway:** BERT memorizes its training labels, so it almost never disagrees with originals on training data. Consensus is too strict
+### 9. Soft Knowledge Distillation
+- **Approach:** Got GPT-5-mini top-5 predictions with confidence scores as soft labels. Trained with blended loss: alpha * CE(hard) + (1-alpha) * KL(soft || student), alpha=0.5
+- **Result:** Top-1: 95.32% (-0.44pp) | Top-3: 97.48% (-0.58pp)
+- **Takeaway:** LLM self-reported confidence scores are too noisy/uniform. Soft KL loss stayed flat at ~3.5. Would need actual logprobs for this to work
+### 10. Cleanlab: Remove Mislabeled Samples
+- **Approach:** Confident learning (Northcutt et al. 2021). 3-fold cross-validation for out-of-sample probabilities, then `find_label_issues()` to detect mislabeled samples. Removed 142 flagged training samples and retrained
+- **Result:** Top-1: 95.90% (+0.14pp) | Top-3: 97.70% (-0.36pp)
+- **Takeaway:** Small top-1 gain, but removing ambiguous samples hurt ranked predictions. Manual inspection confirmed ~99% of flagged samples were genuinely mislabeled
+### 11. Cleanlab: Relabel Instead of Remove
+- **Approach:** Same cleanlab detection, but replaced wrong labels with model's predicted label instead of removing samples
+- **Result (vs original test labels):** Top-1: 95.80% | Top-3: 97.92% | Top-5: 98.46%
+- **Result (vs corrected test labels):** Top-1: 98.06% | Top-3: 99.09% | Top-5: 99.38%
+- **Takeaway:** The ~2pp gap between original and corrected evaluation reveals that the remaining "errors" are mostly test set noise, not model mistakes. True model performance is ~98% top-1
+### 12. Production Model (This Model)
+- **Approach:** Fresh 3-fold cleanlab on the **entire** dataset (13,672 samples). Found 212 mislabeled samples (1.6%), relabeled all. Trained on full corrected dataset for 6 epochs
+- **Sanity check:** 99.74% accuracy on training data (expected, since model saw all data)
+- **Estimated true accuracy:** ~98% top-1, ~99% top-3 based on corrected-label evaluation
+## Key Findings
+1. **Label quality > model architecture.** Label merging (+5pp) and cleanlab corrections (+2pp true accuracy) had more impact than any model change
+2. **DistilBERT is sufficient.** Domain-specific models (ClinicalBERT, BioClinicalBERT) and longer context (Longformer) didn't help
+3. **~1.6% of labels are wrong.** Discharge summary (9.1%), Paediatrics (7.2%), and Physiotherapy (6.8%) are the noisiest classes
+4. **The model is better than naive metrics suggest.** When evaluated against corrected labels, top-1 jumps from ~96% to ~98%
+## Labels (49 classes)
+- `A&E`
+- `Ambulance Notification`
+- `Audiology`
+- `Bowel Cancer Screening`
+- `Breast Clinic`
+- `Cancer Screening`
+- `Cardiology`
+- `Colposcopy`
+- `Dermatology`
+- `Diabetes & Endocrine`
+- `Diet Services`
+- `Discharge summary`
+- `ENT`
+- `Echocardiogram`
+- `Elderly Care`
+- `Gastroenterology`
+- `General Surgery`
+- `Genetics`
+- `Haematology`
+- `INR`
+- `Immunology`
+- `Mammogram`
+- `Maternity`
+- `Maxillofacial`
+- `Mental Health`
+- `Neurology`
+- `Neurosurgery`
+- `Obstetrics & Gynaecology`
+- `Oncology`
+- `Ophthalmology`
+- `Orthopaedics`
+- `Out of Hours`
+- `Paediatrics`
+- `Pain Management`
+- `Pharmacy`
+- `Physiotherapy`
+- `Plastic Surgery`
+- `Radiology`
+- `Renal`
+- `Respiratory`
+- `Retinal Screening`
+- `Rheumatology`
+- `Sexual Health`
+- `Speech and Language Therapy`
+- `Stroke Services`
+- `Urgent Care Centre`
+- `Urology`
+- `Vascular`
+- `Walk in Centre`
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch, json
+model = AutoModelForSequenceClassification.from_pretrained("mansour94/kynoby-william-bert-classifier")
+tokenizer = AutoTokenizer.from_pretrained("mansour94/kynoby-william-bert-classifier")
+# Load label map
+from huggingface_hub import hf_hub_download
+label_map = json.load(open(hf_hub_download("mansour94/kynoby-william-bert-classifier", "label_map.json")))
+id2label = {int(k): v for k, v in label_map["id2label"].items()}
+text = "Dear Dr Smith, I am writing to inform you about the patient's ophthalmology appointment..."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+with torch.no_grad():
+    logits = model(**inputs).logits
+    probs = torch.softmax(logits, dim=-1)
+# Top-3 predictions
+top3 = torch.topk(probs, 3)
+for i in range(3):
+    idx = top3.indices[0][i].item()
+    conf = top3.values[0][i].item()
+    print(f"  {id2label[idx]}: {conf:.1%}")
+```