FraudLens: ICD-10 Clinical Coding Model
LoRA adapter for Llama 3.1 8B that predicts ICD-10-CM diagnosis codes from hospital discharge summaries. Built as the core model for a clinical-coding consistency auditor.
Quick Start
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = "unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit"
adapter = "saxon11/fraudlens-icd10-llama3.1-8b-lora"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, load_in_4bit=True, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
messages = [
{"role": "system", "content": "You are a clinical coding auditor..."},
{"role": "user", "content": "<your discharge summary>"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=200, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Output Format
<reasoning>Patient presented with acute respiratory failure requiring BiPAP.
Acute kidney injury with creatinine 3.2. Type 2 diabetes managed with insulin.
Chronic hypertension on home meds.</reasoning>
<codes>
<code>J9601</code>
<code>N179</code>
<code>E119</code>
<code>I10</code>
</codes>
Evaluation (50-case held-out test set)
| Metric | Value |
|---|---|
| Precision | 0.387 |
| Recall | 0.272 |
| F1 | 0.294 |
| Avg predicted codes | 3.4 |
| Avg true codes | 5.6 |
The model is conservative โ strong precision but under-predicts acute conditions. It reliably captures chronic conditions (hypertension, diabetes, hyperlipidemia) but often misses acute organ failures and metabolic derangements.
Training
Two-phase approach:
- SFT Warm-Start (200 steps): Supervised fine-tuning on filtered high-acuity MIMIC-IV cases to teach ICD-10 vocabulary and output format
- GRPO Reinforcement Learning (750 steps): Group Relative Policy Optimization with multi-component reward:
- XML structure reward (0-0.125)
- Format compliance โ reasoning + codes (0-0.5)
- Valid ICD-10 code reward (0-1.0)
- Two-level F1 โ exact match + category-level (0-6.5)
- Brevity penalty (prevents code dumping)
- Repetition penalty (prevents duplicate codes)
Training data: MIMIC-IV discharge summaries (PhysioNet) with linked ICD-10-CM codes. ~92K training examples, 50 held-out test cases.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | Llama 3.1 8B Instruct (4-bit) |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| Target modules | q, k, v, o, gate, up, down proj |
| GRPO learning rate | 1e-6 |
| SFT learning rate | 2e-5 |
| Max sequence length | 5120 |
| Batch size | 4 |
| Num generations | 4 |
Training Journey
14 iterations of reward engineering across 4 weeks:
- Runs 1-3: Pure GRPO failed (F1 = 0.026). Model couldn't learn ICD-10 codes from scratch.
- Run 7: SFT warm-start breakthrough (test F1 = 0.303), but format collapsed.
- Runs 8-13: Iterative reward fixes โ soft format multiplier, brevity penalty, filtered SFT data.
- Run 14-15: Best balanced results. Filtered SFT on high-acuity cases + anti-repetition + relaxed brevity.
Limitations
- Under-prediction: Averages 3.4 codes vs 5.6 true. Misses acute conditions.
- Single-center training data: MIMIC-IV is from Beth Israel Deaconess only.
- Research only: Not validated for clinical use.
- Generic reasoning: Tends to use template reasoning rather than case-specific analysis.
Framework Versions
- Transformers: 4.57.1
- PEFT: 0.18.1
- TRL: 0.23.1
- PyTorch: 2.9.0+cu128
- Unsloth: latest
- Downloads last month
- 39