FraudLens: ICD-10 Clinical Coding Model

LoRA adapter for Llama 3.1 8B that predicts ICD-10-CM diagnosis codes from hospital discharge summaries. Built as the core model for a clinical-coding consistency auditor.

Quick Start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = "unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit"
adapter = "saxon11/fraudlens-icd10-llama3.1-8b-lora"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, load_in_4bit=True, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

messages = [
    {"role": "system", "content": "You are a clinical coding auditor..."},
    {"role": "user", "content": "<your discharge summary>"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=200, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Output Format

<reasoning>Patient presented with acute respiratory failure requiring BiPAP.
Acute kidney injury with creatinine 3.2. Type 2 diabetes managed with insulin.
Chronic hypertension on home meds.</reasoning>
<codes>
<code>J9601</code>
<code>N179</code>
<code>E119</code>
<code>I10</code>
</codes>

Evaluation (50-case held-out test set)

Metric Value
Precision 0.387
Recall 0.272
F1 0.294
Avg predicted codes 3.4
Avg true codes 5.6

The model is conservative โ€” strong precision but under-predicts acute conditions. It reliably captures chronic conditions (hypertension, diabetes, hyperlipidemia) but often misses acute organ failures and metabolic derangements.

Training

Two-phase approach:

  1. SFT Warm-Start (200 steps): Supervised fine-tuning on filtered high-acuity MIMIC-IV cases to teach ICD-10 vocabulary and output format
  2. GRPO Reinforcement Learning (750 steps): Group Relative Policy Optimization with multi-component reward:
    • XML structure reward (0-0.125)
    • Format compliance โ€” reasoning + codes (0-0.5)
    • Valid ICD-10 code reward (0-1.0)
    • Two-level F1 โ€” exact match + category-level (0-6.5)
    • Brevity penalty (prevents code dumping)
    • Repetition penalty (prevents duplicate codes)

Training data: MIMIC-IV discharge summaries (PhysioNet) with linked ICD-10-CM codes. ~92K training examples, 50 held-out test cases.

Hyperparameters

Parameter Value
Base model Llama 3.1 8B Instruct (4-bit)
LoRA rank 32
LoRA alpha 32
Target modules q, k, v, o, gate, up, down proj
GRPO learning rate 1e-6
SFT learning rate 2e-5
Max sequence length 5120
Batch size 4
Num generations 4

Training Journey

14 iterations of reward engineering across 4 weeks:

  • Runs 1-3: Pure GRPO failed (F1 = 0.026). Model couldn't learn ICD-10 codes from scratch.
  • Run 7: SFT warm-start breakthrough (test F1 = 0.303), but format collapsed.
  • Runs 8-13: Iterative reward fixes โ€” soft format multiplier, brevity penalty, filtered SFT data.
  • Run 14-15: Best balanced results. Filtered SFT on high-acuity cases + anti-repetition + relaxed brevity.

Limitations

  • Under-prediction: Averages 3.4 codes vs 5.6 true. Misses acute conditions.
  • Single-center training data: MIMIC-IV is from Beth Israel Deaconess only.
  • Research only: Not validated for clinical use.
  • Generic reasoning: Tends to use template reasoning rather than case-specific analysis.

Framework Versions

  • Transformers: 4.57.1
  • PEFT: 0.18.1
  • TRL: 0.23.1
  • PyTorch: 2.9.0+cu128
  • Unsloth: latest
Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support