BUILT WITH LLAMA!

Model Card for Llama-3.2-11B-CXR

Llama-3.2-11B-CXR is my first attempt at fine tuning a general purpose open-weights vision-language model for chest X-ray structured report generation. The model has been fine-tuned to generate radiological reports in a structured JSON format.

{
"Support devices": "None.",
"Cardiomediastinum": "Within normal limits.",
"Lungs": "Lungs are clear.",
"Pleura": "No pleural effusion or pneumothorax.",
"Skeleton": "No acute findings.",
"Upper abdomen": "No acute findings."
}

Model Details

Model Description

These are adapters for meta-llama/Llama-3.2-11B-Vision-Instruct, obtained through supervised fine-tuning (SFT) with low-rank adapters (LoRA) using a custom subset of publicly available frontal chest x-rays from the romprr/CXR_BioXAi_Hackathon_2024 dataset.

  • Developed, funded and shared by: Nakul Gupta
  • Model type: Multi-modal Large Language Model
  • Language(s) (NLP): SFT was done in English language, although base model supports additional languages.
  • License: Llama 3.2 Community License (a custom, commercial license agreement).
  • Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct

Uses

This model is SOLELY intended for research and development purposes. It is by no means ready or meant for clinical use, nor has it been validated in a clinical setting.

Out-of-Scope Use

This model has NOT been validated for clinical use or evaluated by any regulatory bodies and may experience hallucinations as well as missed findings. It is intended for research and developmental use ONLY. The models outputs are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. All model outputs require independent verification and further investigation through established scientific research and development methodologies.

Bias, Risks, and Limitations

Results and model outputs are heavily dependent upon the specific prompt/instruction as well as inferencing techniques (temperature, top_p, min_p, etc.). The model has been optimized only for single-turn, single-image evaluation. The model may also suffer from data contamination/leakage, where the model may have been exposed to evaluation data during pre-training of fine-tuning, which may lead to overestimation of its true capabilities. Therefore, the model requires validation on datasets specific to each individual's/institutions use case.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

base_model = "meta-llama/Llama-3.2-11B-Vision-Instruct"
adapter_id = "DeepRadiology/Llama-3.2-11B-CXR"

model = AutoModelForVision2Seq.from_pretrained(
    base_model,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

adapter_name = model.load_adapter(adapter_id)
model.active_adapters = adapter_name
processor = AutoProcessor.from_pretrained(base_model)
image = Image.open("cxr.jpeg") # replace with your own example image

instruction = """You are an expert chest radiologist. Describe accurately what you see in this image. Use a \
structured report template with fields for: Support devices, Cardiomediastinum, Lungs, Pleura, Skeleton, and Upper \
abdomen. If there are no support devices, then report "None." for that field, if there are no pertinent \
Cardiomediastinal findings, report "Within normal limits." for that field. If there are no abnormal lung findings \
report "Lungs are clear." If there are no pertinent pleural findings, report "No pleural effusion or pneumothorax." \
For all other fields, if there are no pertinent findings, report "No acute findings." You must always generate a report\
 with the required fields."""

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, min_p=0.1)
print(processor.decode(output[0]))

Training Details

Training Data

romprr/CXR_BioXAi_Hackathon_2024.

Training Procedure

Preprocessing

Dataset was filtered using meta-llama/Llama-3.3-70B-Instruct to remove reports with references to priors (although this was not 100% successful). The remaining free-text reports were then converted into a structured report format, again using meta-llama/Llama-3.3-70B-Instruct. The final training set was approximately 33k x-rays.

Training Hyperparameters

args = SFTConfig(
    num_train_epochs=3,
    per_device_train_batch_size=4, 
    gradient_accumulation_steps=8,
    gradient_checkpointing=False,
    optim='adamw_torch_fused',
    learning_rate=2e-5,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "fc1", "fc2"], 
    modules_to_save=['lm_head', 'embed_tokens'],
    task_type="CAUSAL_LM"
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Training Time

4d 20h 53s with 3x Nvidia RTX 3090's

Evaluation

Testing Data

Evaluation was performed using publically available IU-Xray and MIMIC-CXR datasets, using 'test' splits and frontal x-rays only as defined by RexRank.

Metrics

BLEU-2, BLEU-4, ROUGE_L, METEOR, and Radgraph F1 (simple, partial, and complete) metrics were assessed, and compared to the baseline model as well as medgemma-4b-it. The same user prompt/instruction was used as was used for SFT. Llama-based models were inferenced with temperature=0.1 and min_p=0.1, medgemma was inferenced with greedy sampling (do_sample=False).

Results

Llama-3.2-11B-CXR performs similarly to slightly better than the medical specific MedGemma 4b, besting it in Radgraph (RG) metrics on both datasets.

Model Dataset BLEU-2 BLEU-4 ROUGE-L METEOR RG-Simple RG-Partial RG-Complete
Llama-3.2-11B-Vision-Instruct IU-Xray 0.0474 0.0137 0.1787 0.2027 0.2065 0.1867 0.1095
Llama-3.2-11B-CXR IU-Xray 0.0485 0.0099 0.2464 0.1132 0.2555 0.2438 0.1743
MedGemma-4b-it IU-Xray 0.0627 0.0184 0.2024 0.2172 0.2327 0.2196 0.1831
Llama-3.2-11B-Vision-Instruct MIMIC-CXR 0.0631 0.0159 0.1460 0.1574 0.1340 0.1201 0.0800
Llama-3.2-11B-CXR MIMIC-CXR 0.0890 0.0242 0.1896 0.1568 0.1853 0.1677 0.1130
MedGemma-4b-it MIMIC-CXR 0.0794 0.0218 0.1724 0.1887 0.1772 0.1597 0.1141

Summary

Presenting Llama-3.2-CXR-11B, a multi-modal open-weights vision language model (VLM) fine-tuned for chest x-ray report generation! The primary goal of this exercise was to demonstrate the potential for general purpose VLM's to be re-purposed for medical imaging tasks on consumer grade hardware with publicly available datasets.

Model Card Contact

Nakul Gupta

Data Citations

MIMIC-CXR:

Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/C2JT1Q

Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019). https://doi.org/10.1038/s41597-019-0322-0

IU-Xray:

Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc. 2016 Mar;23(2):304-10. doi: 10.1093/jamia/ocv080. Epub 2015 Jul 1. PMID: 26133894; PMCID: PMC5009925.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DeepRadiology/Llama-3.2-11B-CXR

Finetuned
(163)
this model

Dataset used to train DeepRadiology/Llama-3.2-11B-CXR